Effects of L2 loss and smooth L1 loss Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsRight choice of accuracy metric or loss functionLoss function to maximize sum of targetsDecomposable output regression neural networkLog loss and expected aggregatesCustom loss function with additional parameter in KerasL2 loss vs. mean squared lossReason for having both low loss and same predicted class?Enable to reproduce the loss of training while predictingContrastive loss problem in a character-level, siamese NN modelDifference between “reducing batch_size” and “increasing epochs” to decrease loss amount?

The more you know, the more you don't know

What is the meaning of the simile “quick as silk”?

What causes the direction of lightning flashes?

Why wasn't DOSKEY integrated with COMMAND.COM?

Most bit efficient text communication method?

Did Deadpool rescue all of the X-Force?

Did MS DOS itself ever use blinking text?

What is the effect of altitude on true airspeed?

Has negative voting ever been officially implemented in elections, or seriously proposed, or even studied?

How do I find out the mythology and history of my Fortress?

As a beginner, should I get a Squier Strat with a SSS config or a HSS?

Are there mentions in Hinduism about instruments which allows one to know others thoughts and influence them? And is it sinful?

What's the meaning of "fortified infraction restraint"?

Sum letters are not two different

Is grep documentation wrong?

Why did Roosevelt decide to implement a maximum wage through taxation rather than a simple ceiling?

Why do we bend a book to keep it straight?

SF book about people trapped in a series of worlds they imagine

Is it possible to add Lighting Web Component in the Visual force Page?

Is it ethical to give a final exam after the professor has quit before teaching the remaining chapters of the course?

What are the out-of-universe reasons for the references to Toby Maguire-era Spider-Man in Into the Spider-Verse?

Selecting user stories during sprint planning

Catmull-Clark and Doo-Sabin Subdivision Codes

Why is Nikon 1.4g better when Nikon 1.8g is sharper?



Effects of L2 loss and smooth L1 loss



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsRight choice of accuracy metric or loss functionLoss function to maximize sum of targetsDecomposable output regression neural networkLog loss and expected aggregatesCustom loss function with additional parameter in KerasL2 loss vs. mean squared lossReason for having both low loss and same predicted class?Enable to reproduce the loss of training while predictingContrastive loss problem in a character-level, siamese NN modelDifference between “reducing batch_size” and “increasing epochs” to decrease loss amount?










1












$begingroup$


Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?










share|improve this question











$endgroup$
















    1












    $begingroup$


    Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?










    share|improve this question











    $endgroup$














      1












      1








      1





      $begingroup$


      Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?










      share|improve this question











      $endgroup$




      Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?







      loss-function






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 3 at 11:33









      bradS

      667213




      667213










      asked Apr 3 at 4:29









      HOANG GIANGHOANG GIANG

      426




      426




















          1 Answer
          1






          active

          oldest

          votes


















          2












          $begingroup$

          First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".



          Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
          $$beginalign*
          L_delta(y_n, f_theta(boldsymbolx_n))
          =left{
          beginmatrix
          frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
          deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
          endmatrix
          right.
          endalign*$$



          where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.



          Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
          $$fracpartial deltaleftpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
          compared to squared loss, where gradient increases with the difference, i.e.
          $$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$



          which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
          $$beginalign*
          theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
          &= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
          endalign*$$



          It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.




          When to use each of them?




          Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.






          share|improve this answer











          $endgroup$













            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48482%2feffects-of-l2-loss-and-smooth-l1-loss%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2












            $begingroup$

            First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".



            Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
            $$beginalign*
            L_delta(y_n, f_theta(boldsymbolx_n))
            =left{
            beginmatrix
            frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
            deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
            endmatrix
            right.
            endalign*$$



            where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.



            Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
            $$fracpartial deltaleftpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
            compared to squared loss, where gradient increases with the difference, i.e.
            $$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$



            which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
            $$beginalign*
            theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
            &= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
            endalign*$$



            It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.




            When to use each of them?




            Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.






            share|improve this answer











            $endgroup$

















              2












              $begingroup$

              First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".



              Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
              $$beginalign*
              L_delta(y_n, f_theta(boldsymbolx_n))
              =left{
              beginmatrix
              frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
              deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
              endmatrix
              right.
              endalign*$$



              where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.



              Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
              $$fracpartial deltaleftpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
              compared to squared loss, where gradient increases with the difference, i.e.
              $$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$



              which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
              $$beginalign*
              theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
              &= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
              endalign*$$



              It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.




              When to use each of them?




              Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.






              share|improve this answer











              $endgroup$















                2












                2








                2





                $begingroup$

                First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".



                Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
                $$beginalign*
                L_delta(y_n, f_theta(boldsymbolx_n))
                =left{
                beginmatrix
                frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
                deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
                endmatrix
                right.
                endalign*$$



                where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.



                Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
                $$fracpartial deltaleftpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
                compared to squared loss, where gradient increases with the difference, i.e.
                $$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$



                which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
                $$beginalign*
                theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
                &= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
                endalign*$$



                It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.




                When to use each of them?




                Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.






                share|improve this answer











                $endgroup$



                First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".



                Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
                $$beginalign*
                L_delta(y_n, f_theta(boldsymbolx_n))
                =left{
                beginmatrix
                frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
                deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
                endmatrix
                right.
                endalign*$$



                where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.



                Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
                $$fracpartial deltaleftpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
                compared to squared loss, where gradient increases with the difference, i.e.
                $$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$



                which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
                $$beginalign*
                theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
                &= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
                endalign*$$



                It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.




                When to use each of them?




                Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Apr 3 at 9:03

























                answered Apr 3 at 7:38









                EsmailianEsmailian

                3,391420




                3,391420



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48482%2feffects-of-l2-loss-and-smooth-l1-loss%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                    Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                    Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High