In softmax classifier, why use exp function to do normalization?2019 Community Moderator ElectionNeural networks: which cost function to use?TensorFlow doesn't learn when input=output (or probably I am missing something)Loss given Activation Function and Probability ModelIs cross-entropy a good cost function if I'm interested in the probabilities of a sample belonging to a certain class?Fractions or probabilities as training labelsFinal layer of neural network responsible for overfittingasymmetric cost function for deep neural network binary classifierAdjusted coefficient pearson as CNN loss functionweighted cross entropy for imbalanced dataset - multiclass classification

Short story about space worker geeks who zone out by 'listening' to radiation from stars

What can we do to stop prior company from asking us questions?

Customer Requests (Sometimes) Drive Me Bonkers!

I'm in charge of equipment buying but no one's ever happy with what I choose. How to fix this?

Do the temporary hit points from the Battlerager barbarian's Reckless Abandon stack if I make multiple attacks on my turn?

Abbreviate author names as "Lastname AB" (without space or period) in bibliography

Lay out the Carpet

How can I quit an app using Terminal?

How do I extract a value from a time formatted value in excel?

Why Were Madagascar and New Zealand Discovered So Late?

Purchasing a ticket for someone else in another country?

Pole-zeros of a real-valued causal FIR system

Term for the "extreme-extension" version of a straw man fallacy?

How do I rename a Linux host without needing to reboot for the rename to take effect?

What is the opposite of 'gravitas'?

How long to clear the 'suck zone' of a turbofan after start is initiated?

Number of words that can be made using all the letters of the word W, if Os as well as Is are separated is?

when is out of tune ok?

How did Doctor Strange see the winning outcome in Avengers: Infinity War?

How to be diplomatic in refusing to write code that breaches the privacy of our users

How does buying out courses with grant money work?

Sort a list by elements of another list

Is this apparent Class Action settlement a spam message?

What grammatical function is や performing here?



In softmax classifier, why use exp function to do normalization?



2019 Community Moderator ElectionNeural networks: which cost function to use?TensorFlow doesn't learn when input=output (or probably I am missing something)Loss given Activation Function and Probability ModelIs cross-entropy a good cost function if I'm interested in the probabilities of a sample belonging to a certain class?Fractions or probabilities as training labelsFinal layer of neural network responsible for overfittingasymmetric cost function for deep neural network binary classifierAdjusted coefficient pearson as CNN loss functionweighted cross entropy for imbalanced dataset - multiclass classification










18












$begingroup$


Why use softmax as opposed to standard normalization? In the comment area of the top answer of this question, @Kilian Batzner raised 2 questions which also confuse me a lot. It seems no one gives an explanation except numerical benefits.




I get the reasons for using Cross-Entropy Loss, but how does that relate to the softmax? You said "the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth". Suppose, I would use standard / linear normalization, but still use the Cross-Entropy Loss. Then I would also try to minimize the Cross-Entropy. So how is the softmax linked to the Cross-Entropy except for the numerical benefits?



As for the probabilistic view: what is the motivation for looking at log probabilities? The reasoning seems to be a bit like "We use e^x in the softmax, because we interpret x as log-probabilties". With the same reasoning we could say, we use e^e^e^x in the softmax, because we interpret x as log-log-log-probabilities (Exaggerating here, of course). I get the numerical benefits of softmax, but what is the theoretical motivation for using it?











share|improve this question









$endgroup$











  • $begingroup$
    It is differentiable, leads to non-negative results (such as would be necessary for a probability so the cross-entropy can be calculated), and behaves like the max function, which is appropriate in a classification setting. Welcome to the site!
    $endgroup$
    – Emre
    Sep 20 '17 at 6:09











  • $begingroup$
    @Emre Thanks! But what does "behaves like max function" mean? Besides, if I have another function that is also differentiable, monotone increasing and leads to non-negative results, can I use it to replace the exp function in the formula?
    $endgroup$
    – Hans
    Sep 20 '17 at 7:12










  • $begingroup$
    When you normalize using $max$, the greatest argument gets mapped to 1 while the rest get mapped to zero, owing to the growth of the exponential fuction.
    $endgroup$
    – Emre
    Sep 20 '17 at 16:02















18












$begingroup$


Why use softmax as opposed to standard normalization? In the comment area of the top answer of this question, @Kilian Batzner raised 2 questions which also confuse me a lot. It seems no one gives an explanation except numerical benefits.




I get the reasons for using Cross-Entropy Loss, but how does that relate to the softmax? You said "the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth". Suppose, I would use standard / linear normalization, but still use the Cross-Entropy Loss. Then I would also try to minimize the Cross-Entropy. So how is the softmax linked to the Cross-Entropy except for the numerical benefits?



As for the probabilistic view: what is the motivation for looking at log probabilities? The reasoning seems to be a bit like "We use e^x in the softmax, because we interpret x as log-probabilties". With the same reasoning we could say, we use e^e^e^x in the softmax, because we interpret x as log-log-log-probabilities (Exaggerating here, of course). I get the numerical benefits of softmax, but what is the theoretical motivation for using it?











share|improve this question









$endgroup$











  • $begingroup$
    It is differentiable, leads to non-negative results (such as would be necessary for a probability so the cross-entropy can be calculated), and behaves like the max function, which is appropriate in a classification setting. Welcome to the site!
    $endgroup$
    – Emre
    Sep 20 '17 at 6:09











  • $begingroup$
    @Emre Thanks! But what does "behaves like max function" mean? Besides, if I have another function that is also differentiable, monotone increasing and leads to non-negative results, can I use it to replace the exp function in the formula?
    $endgroup$
    – Hans
    Sep 20 '17 at 7:12










  • $begingroup$
    When you normalize using $max$, the greatest argument gets mapped to 1 while the rest get mapped to zero, owing to the growth of the exponential fuction.
    $endgroup$
    – Emre
    Sep 20 '17 at 16:02













18












18








18


17



$begingroup$


Why use softmax as opposed to standard normalization? In the comment area of the top answer of this question, @Kilian Batzner raised 2 questions which also confuse me a lot. It seems no one gives an explanation except numerical benefits.




I get the reasons for using Cross-Entropy Loss, but how does that relate to the softmax? You said "the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth". Suppose, I would use standard / linear normalization, but still use the Cross-Entropy Loss. Then I would also try to minimize the Cross-Entropy. So how is the softmax linked to the Cross-Entropy except for the numerical benefits?



As for the probabilistic view: what is the motivation for looking at log probabilities? The reasoning seems to be a bit like "We use e^x in the softmax, because we interpret x as log-probabilties". With the same reasoning we could say, we use e^e^e^x in the softmax, because we interpret x as log-log-log-probabilities (Exaggerating here, of course). I get the numerical benefits of softmax, but what is the theoretical motivation for using it?











share|improve this question









$endgroup$




Why use softmax as opposed to standard normalization? In the comment area of the top answer of this question, @Kilian Batzner raised 2 questions which also confuse me a lot. It seems no one gives an explanation except numerical benefits.




I get the reasons for using Cross-Entropy Loss, but how does that relate to the softmax? You said "the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth". Suppose, I would use standard / linear normalization, but still use the Cross-Entropy Loss. Then I would also try to minimize the Cross-Entropy. So how is the softmax linked to the Cross-Entropy except for the numerical benefits?



As for the probabilistic view: what is the motivation for looking at log probabilities? The reasoning seems to be a bit like "We use e^x in the softmax, because we interpret x as log-probabilties". With the same reasoning we could say, we use e^e^e^x in the softmax, because we interpret x as log-log-log-probabilities (Exaggerating here, of course). I get the numerical benefits of softmax, but what is the theoretical motivation for using it?








machine-learning deep-learning






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Sep 20 '17 at 5:53









HansHans

9315




9315











  • $begingroup$
    It is differentiable, leads to non-negative results (such as would be necessary for a probability so the cross-entropy can be calculated), and behaves like the max function, which is appropriate in a classification setting. Welcome to the site!
    $endgroup$
    – Emre
    Sep 20 '17 at 6:09











  • $begingroup$
    @Emre Thanks! But what does "behaves like max function" mean? Besides, if I have another function that is also differentiable, monotone increasing and leads to non-negative results, can I use it to replace the exp function in the formula?
    $endgroup$
    – Hans
    Sep 20 '17 at 7:12










  • $begingroup$
    When you normalize using $max$, the greatest argument gets mapped to 1 while the rest get mapped to zero, owing to the growth of the exponential fuction.
    $endgroup$
    – Emre
    Sep 20 '17 at 16:02
















  • $begingroup$
    It is differentiable, leads to non-negative results (such as would be necessary for a probability so the cross-entropy can be calculated), and behaves like the max function, which is appropriate in a classification setting. Welcome to the site!
    $endgroup$
    – Emre
    Sep 20 '17 at 6:09











  • $begingroup$
    @Emre Thanks! But what does "behaves like max function" mean? Besides, if I have another function that is also differentiable, monotone increasing and leads to non-negative results, can I use it to replace the exp function in the formula?
    $endgroup$
    – Hans
    Sep 20 '17 at 7:12










  • $begingroup$
    When you normalize using $max$, the greatest argument gets mapped to 1 while the rest get mapped to zero, owing to the growth of the exponential fuction.
    $endgroup$
    – Emre
    Sep 20 '17 at 16:02















$begingroup$
It is differentiable, leads to non-negative results (such as would be necessary for a probability so the cross-entropy can be calculated), and behaves like the max function, which is appropriate in a classification setting. Welcome to the site!
$endgroup$
– Emre
Sep 20 '17 at 6:09





$begingroup$
It is differentiable, leads to non-negative results (such as would be necessary for a probability so the cross-entropy can be calculated), and behaves like the max function, which is appropriate in a classification setting. Welcome to the site!
$endgroup$
– Emre
Sep 20 '17 at 6:09













$begingroup$
@Emre Thanks! But what does "behaves like max function" mean? Besides, if I have another function that is also differentiable, monotone increasing and leads to non-negative results, can I use it to replace the exp function in the formula?
$endgroup$
– Hans
Sep 20 '17 at 7:12




$begingroup$
@Emre Thanks! But what does "behaves like max function" mean? Besides, if I have another function that is also differentiable, monotone increasing and leads to non-negative results, can I use it to replace the exp function in the formula?
$endgroup$
– Hans
Sep 20 '17 at 7:12












$begingroup$
When you normalize using $max$, the greatest argument gets mapped to 1 while the rest get mapped to zero, owing to the growth of the exponential fuction.
$endgroup$
– Emre
Sep 20 '17 at 16:02




$begingroup$
When you normalize using $max$, the greatest argument gets mapped to 1 while the rest get mapped to zero, owing to the growth of the exponential fuction.
$endgroup$
– Emre
Sep 20 '17 at 16:02










3 Answers
3






active

oldest

votes


















25












$begingroup$

It is more than just numerical. A quick reminder of the softmax:
$$
P(y=j | x) = frace^x_jsum_k=1^K e^x_k
$$



Where $x$ is an input vector with length equal to the number of classes $K$. The softmax function has 3 very nice properties: 1. it normalizes your data (outputs a proper probability distribution), 2. is differentiable, and 3. it uses the exp you mentioned. A few important points:



  1. The loss function is not directly related to softmax. You can use standard normalization and still use cross-entropy.


  2. A "hardmax" function (i.e. argmax) is not differentiable. The softmax gives at least a minimal amount of probability to all elements in the output vector, and so is nicely differentiable, hence the term "soft" in softmax.


  3. Now I get to your question. The $e$ in softmax is the natural exponential function. Before we normalize, we transform $x$ as in the graph of $e^x$:


natural exponential function



If $x$ is 0 then $y=1$, if $x$ is 1, then $y=2.7$, and if $x$ is 2, now $y=7$! A huge step! This is what's called a non-linear transformation of our unnormalized log scores. The interesting property of the exponential function combined with the normalization in the softmax is that high scores in $x$ become much more probable than low scores.



An example. Say $K=4$, and your log score $x$ is vector $[2, 4, 2, 1]$. The simple argmax function outputs:



$$
[0, 1, 0, 0]
$$



The argmax is the goal, but it's not differentiable and we can't train our model with it :( A simple normalization, which is differentiable, outputs the following probabilities:



$$
[0.2222, 0.4444, 0.2222, 0.1111]
$$



That's really far from the argmax! :( Whereas the softmax outputs:
$$
[0.1025, 0.7573, 0.1025, 0.0377]
$$



That's much closer to the argmax! Because we use the natural exponential, we hugely increase the probability of the biggest score and decrease the probability of the lower scores when compared with standard normalization. Hence the "max" in softmax.






share|improve this answer











$endgroup$












  • $begingroup$
    @Hans, awesome! If this answered your question, please consider clicking it as answered
    $endgroup$
    – vega
    Oct 31 '17 at 18:59






  • 3




    $begingroup$
    Great info. However, instead of using e, what about using a constant say 3, or 4 ? Will the outcome be the same?
    $endgroup$
    – Cheok Yan Cheng
    Nov 23 '17 at 20:45






  • 4




    $begingroup$
    @CheokYanCheng, yes. But e has a nicer derivative ;)
    $endgroup$
    – vega
    Nov 25 '17 at 23:55










  • $begingroup$
    I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
    $endgroup$
    – jvalle
    Oct 26 '18 at 21:13










  • $begingroup$
    @vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
    $endgroup$
    – Gulzar
    Dec 9 '18 at 15:39


















1












$begingroup$

In addition to vega's explanation,



let's define generic softmax:
$$P(y=j | x) = fracpsi^x_jsum_k=1^K psi^x_k$$
where $psi$ is a constant >= 1



if $psi=1$, then you are pretty far from argmax as @vega mentioned.



Let's now assume $psi=100$, now you are pretty close to the argmax but you also have a really small numbers for negative values and big numbers for positives. This numbers overflows the float point arithmetic limit easily(for example maximum limit of numpy float64 is $10^308$). In addition to that, even if the selection is $psi=e$ which is much smaller than $100$, frameworks should implement a more stable version of softmax (multiplying both numerator and denominator with constant $C$) since results become to small to be able to express with such precision.



So, you want to pick a constant big enough to approximate argmax well, and also small enough to express these big and small numbers in calculations.



And of course, $e$ also has pretty nice derivative.






share|improve this answer










New contributor




komunistbakkal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$




















    1












    $begingroup$

    This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.



    I will explain this by using an example with $N$ images, which are constituted by $n_1$ images from the class $mathcalC_1$, $n_2$ images from the class $mathcalC_2$, ..., and $n_K$ images from the class $mathcalC_K$. Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' $E_k$ to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.



    The mean energy $barE$ is related to the other energies $E_k$ by the following relationship
    beginequation
    NbarE = sum_k=1^K n_k E_k.qquad (*)
    labeleq:mean_energy
    endequation



    At the same time, we see that the total amount of images can be calculated as the following sum



    beginequation
    N = sum_k=1^Kn_k.qquad (**)
    labeleq:conservation_of_particles
    endequation



    The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class $n_1$ it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.



    The entropy increases with the number of combinations in which we can split the $N$ images into the $n_1$, $n_2$, ..., $n_K$ image classes with corresponding energy. This number of combinations is given by the multinomial coefficient



    beginequation
    beginpmatrix
    N!\
    n_1!,n_2!,ldots,n_K!\
    endpmatrix=dfracN!prod_k=1^K n_k!.
    endequation



    We will try to maximize this number assuming that we have infinitely many images $Nto infty$. But his maximization has also equality constraints $(*)$ and $(**)$. This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers $beta$ and $alpha$ for the equality constraints and we introduce the Lagrange Funktion $mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right)$.



    beginequation
    mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right) = dfracN!prod_k=1^Kn_k!+betaleft[sum_k=1^Kn_k E_k - NbarEright]+alphaleft[N-sum_k=1^K n_kright]
    endequation



    As we assumed $Nto infty$ we can also assume $n_k to infty$ and use the Stirling approximation for the factorial



    beginequation
    ln n! = nln n - n + mathcalO(ln n).
    endequation



    Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to $ln n!$ for $nto infty$.



    The partial derivative of the Lagrange function with respect $n_tildek$ will result in



    $$dfracpartial mathcalLpartial n_tildek=-ln n_tildek-1-alpha+beta E_tildek.$$



    If we set this partial derivative to zero we can find



    $$n_tildek=dfracexp(beta E_tildek)exp(1+alpha). qquad (***)$$



    If we put this back into $(**)$ we can obtain



    $$exp(1+alpha)=dfrac1Nsum_k=1^Kexp(beta E_k).$$



    If we put this back into $(***)$ we get something that should remind us of the softmax function



    $$n_tildek=dfracexp(beta E_tildek)dfrac1Nsum_k=1^Kexp(beta E_k).$$



    If we define $n_tildek/N$ as the probability of class $mathcalC_tildek$ by $p_tildek$ we will obtain something that is really similar to the softmax function



    $$p_tildek=dfracexp(beta E_tildek)sum_k=1^Kexp(beta E_k).$$



    Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set $beta E_tildek=boldsymbolw^T_kboldsymbolx$ we exactly get the definition of the softmax function for the $k^textth$ output.






    share|improve this answer











    $endgroup$












      Your Answer





      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "557"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f23159%2fin-softmax-classifier-why-use-exp-function-to-do-normalization%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      25












      $begingroup$

      It is more than just numerical. A quick reminder of the softmax:
      $$
      P(y=j | x) = frace^x_jsum_k=1^K e^x_k
      $$



      Where $x$ is an input vector with length equal to the number of classes $K$. The softmax function has 3 very nice properties: 1. it normalizes your data (outputs a proper probability distribution), 2. is differentiable, and 3. it uses the exp you mentioned. A few important points:



      1. The loss function is not directly related to softmax. You can use standard normalization and still use cross-entropy.


      2. A "hardmax" function (i.e. argmax) is not differentiable. The softmax gives at least a minimal amount of probability to all elements in the output vector, and so is nicely differentiable, hence the term "soft" in softmax.


      3. Now I get to your question. The $e$ in softmax is the natural exponential function. Before we normalize, we transform $x$ as in the graph of $e^x$:


      natural exponential function



      If $x$ is 0 then $y=1$, if $x$ is 1, then $y=2.7$, and if $x$ is 2, now $y=7$! A huge step! This is what's called a non-linear transformation of our unnormalized log scores. The interesting property of the exponential function combined with the normalization in the softmax is that high scores in $x$ become much more probable than low scores.



      An example. Say $K=4$, and your log score $x$ is vector $[2, 4, 2, 1]$. The simple argmax function outputs:



      $$
      [0, 1, 0, 0]
      $$



      The argmax is the goal, but it's not differentiable and we can't train our model with it :( A simple normalization, which is differentiable, outputs the following probabilities:



      $$
      [0.2222, 0.4444, 0.2222, 0.1111]
      $$



      That's really far from the argmax! :( Whereas the softmax outputs:
      $$
      [0.1025, 0.7573, 0.1025, 0.0377]
      $$



      That's much closer to the argmax! Because we use the natural exponential, we hugely increase the probability of the biggest score and decrease the probability of the lower scores when compared with standard normalization. Hence the "max" in softmax.






      share|improve this answer











      $endgroup$












      • $begingroup$
        @Hans, awesome! If this answered your question, please consider clicking it as answered
        $endgroup$
        – vega
        Oct 31 '17 at 18:59






      • 3




        $begingroup$
        Great info. However, instead of using e, what about using a constant say 3, or 4 ? Will the outcome be the same?
        $endgroup$
        – Cheok Yan Cheng
        Nov 23 '17 at 20:45






      • 4




        $begingroup$
        @CheokYanCheng, yes. But e has a nicer derivative ;)
        $endgroup$
        – vega
        Nov 25 '17 at 23:55










      • $begingroup$
        I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
        $endgroup$
        – jvalle
        Oct 26 '18 at 21:13










      • $begingroup$
        @vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
        $endgroup$
        – Gulzar
        Dec 9 '18 at 15:39















      25












      $begingroup$

      It is more than just numerical. A quick reminder of the softmax:
      $$
      P(y=j | x) = frace^x_jsum_k=1^K e^x_k
      $$



      Where $x$ is an input vector with length equal to the number of classes $K$. The softmax function has 3 very nice properties: 1. it normalizes your data (outputs a proper probability distribution), 2. is differentiable, and 3. it uses the exp you mentioned. A few important points:



      1. The loss function is not directly related to softmax. You can use standard normalization and still use cross-entropy.


      2. A "hardmax" function (i.e. argmax) is not differentiable. The softmax gives at least a minimal amount of probability to all elements in the output vector, and so is nicely differentiable, hence the term "soft" in softmax.


      3. Now I get to your question. The $e$ in softmax is the natural exponential function. Before we normalize, we transform $x$ as in the graph of $e^x$:


      natural exponential function



      If $x$ is 0 then $y=1$, if $x$ is 1, then $y=2.7$, and if $x$ is 2, now $y=7$! A huge step! This is what's called a non-linear transformation of our unnormalized log scores. The interesting property of the exponential function combined with the normalization in the softmax is that high scores in $x$ become much more probable than low scores.



      An example. Say $K=4$, and your log score $x$ is vector $[2, 4, 2, 1]$. The simple argmax function outputs:



      $$
      [0, 1, 0, 0]
      $$



      The argmax is the goal, but it's not differentiable and we can't train our model with it :( A simple normalization, which is differentiable, outputs the following probabilities:



      $$
      [0.2222, 0.4444, 0.2222, 0.1111]
      $$



      That's really far from the argmax! :( Whereas the softmax outputs:
      $$
      [0.1025, 0.7573, 0.1025, 0.0377]
      $$



      That's much closer to the argmax! Because we use the natural exponential, we hugely increase the probability of the biggest score and decrease the probability of the lower scores when compared with standard normalization. Hence the "max" in softmax.






      share|improve this answer











      $endgroup$












      • $begingroup$
        @Hans, awesome! If this answered your question, please consider clicking it as answered
        $endgroup$
        – vega
        Oct 31 '17 at 18:59






      • 3




        $begingroup$
        Great info. However, instead of using e, what about using a constant say 3, or 4 ? Will the outcome be the same?
        $endgroup$
        – Cheok Yan Cheng
        Nov 23 '17 at 20:45






      • 4




        $begingroup$
        @CheokYanCheng, yes. But e has a nicer derivative ;)
        $endgroup$
        – vega
        Nov 25 '17 at 23:55










      • $begingroup$
        I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
        $endgroup$
        – jvalle
        Oct 26 '18 at 21:13










      • $begingroup$
        @vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
        $endgroup$
        – Gulzar
        Dec 9 '18 at 15:39













      25












      25








      25





      $begingroup$

      It is more than just numerical. A quick reminder of the softmax:
      $$
      P(y=j | x) = frace^x_jsum_k=1^K e^x_k
      $$



      Where $x$ is an input vector with length equal to the number of classes $K$. The softmax function has 3 very nice properties: 1. it normalizes your data (outputs a proper probability distribution), 2. is differentiable, and 3. it uses the exp you mentioned. A few important points:



      1. The loss function is not directly related to softmax. You can use standard normalization and still use cross-entropy.


      2. A "hardmax" function (i.e. argmax) is not differentiable. The softmax gives at least a minimal amount of probability to all elements in the output vector, and so is nicely differentiable, hence the term "soft" in softmax.


      3. Now I get to your question. The $e$ in softmax is the natural exponential function. Before we normalize, we transform $x$ as in the graph of $e^x$:


      natural exponential function



      If $x$ is 0 then $y=1$, if $x$ is 1, then $y=2.7$, and if $x$ is 2, now $y=7$! A huge step! This is what's called a non-linear transformation of our unnormalized log scores. The interesting property of the exponential function combined with the normalization in the softmax is that high scores in $x$ become much more probable than low scores.



      An example. Say $K=4$, and your log score $x$ is vector $[2, 4, 2, 1]$. The simple argmax function outputs:



      $$
      [0, 1, 0, 0]
      $$



      The argmax is the goal, but it's not differentiable and we can't train our model with it :( A simple normalization, which is differentiable, outputs the following probabilities:



      $$
      [0.2222, 0.4444, 0.2222, 0.1111]
      $$



      That's really far from the argmax! :( Whereas the softmax outputs:
      $$
      [0.1025, 0.7573, 0.1025, 0.0377]
      $$



      That's much closer to the argmax! Because we use the natural exponential, we hugely increase the probability of the biggest score and decrease the probability of the lower scores when compared with standard normalization. Hence the "max" in softmax.






      share|improve this answer











      $endgroup$



      It is more than just numerical. A quick reminder of the softmax:
      $$
      P(y=j | x) = frace^x_jsum_k=1^K e^x_k
      $$



      Where $x$ is an input vector with length equal to the number of classes $K$. The softmax function has 3 very nice properties: 1. it normalizes your data (outputs a proper probability distribution), 2. is differentiable, and 3. it uses the exp you mentioned. A few important points:



      1. The loss function is not directly related to softmax. You can use standard normalization and still use cross-entropy.


      2. A "hardmax" function (i.e. argmax) is not differentiable. The softmax gives at least a minimal amount of probability to all elements in the output vector, and so is nicely differentiable, hence the term "soft" in softmax.


      3. Now I get to your question. The $e$ in softmax is the natural exponential function. Before we normalize, we transform $x$ as in the graph of $e^x$:


      natural exponential function



      If $x$ is 0 then $y=1$, if $x$ is 1, then $y=2.7$, and if $x$ is 2, now $y=7$! A huge step! This is what's called a non-linear transformation of our unnormalized log scores. The interesting property of the exponential function combined with the normalization in the softmax is that high scores in $x$ become much more probable than low scores.



      An example. Say $K=4$, and your log score $x$ is vector $[2, 4, 2, 1]$. The simple argmax function outputs:



      $$
      [0, 1, 0, 0]
      $$



      The argmax is the goal, but it's not differentiable and we can't train our model with it :( A simple normalization, which is differentiable, outputs the following probabilities:



      $$
      [0.2222, 0.4444, 0.2222, 0.1111]
      $$



      That's really far from the argmax! :( Whereas the softmax outputs:
      $$
      [0.1025, 0.7573, 0.1025, 0.0377]
      $$



      That's much closer to the argmax! Because we use the natural exponential, we hugely increase the probability of the biggest score and decrease the probability of the lower scores when compared with standard normalization. Hence the "max" in softmax.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Oct 27 '17 at 13:02

























      answered Oct 26 '17 at 20:49









      vegavega

      36635




      36635











      • $begingroup$
        @Hans, awesome! If this answered your question, please consider clicking it as answered
        $endgroup$
        – vega
        Oct 31 '17 at 18:59






      • 3




        $begingroup$
        Great info. However, instead of using e, what about using a constant say 3, or 4 ? Will the outcome be the same?
        $endgroup$
        – Cheok Yan Cheng
        Nov 23 '17 at 20:45






      • 4




        $begingroup$
        @CheokYanCheng, yes. But e has a nicer derivative ;)
        $endgroup$
        – vega
        Nov 25 '17 at 23:55










      • $begingroup$
        I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
        $endgroup$
        – jvalle
        Oct 26 '18 at 21:13










      • $begingroup$
        @vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
        $endgroup$
        – Gulzar
        Dec 9 '18 at 15:39
















      • $begingroup$
        @Hans, awesome! If this answered your question, please consider clicking it as answered
        $endgroup$
        – vega
        Oct 31 '17 at 18:59






      • 3




        $begingroup$
        Great info. However, instead of using e, what about using a constant say 3, or 4 ? Will the outcome be the same?
        $endgroup$
        – Cheok Yan Cheng
        Nov 23 '17 at 20:45






      • 4




        $begingroup$
        @CheokYanCheng, yes. But e has a nicer derivative ;)
        $endgroup$
        – vega
        Nov 25 '17 at 23:55










      • $begingroup$
        I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
        $endgroup$
        – jvalle
        Oct 26 '18 at 21:13










      • $begingroup$
        @vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
        $endgroup$
        – Gulzar
        Dec 9 '18 at 15:39















      $begingroup$
      @Hans, awesome! If this answered your question, please consider clicking it as answered
      $endgroup$
      – vega
      Oct 31 '17 at 18:59




      $begingroup$
      @Hans, awesome! If this answered your question, please consider clicking it as answered
      $endgroup$
      – vega
      Oct 31 '17 at 18:59




      3




      3




      $begingroup$
      Great info. However, instead of using e, what about using a constant say 3, or 4 ? Will the outcome be the same?
      $endgroup$
      – Cheok Yan Cheng
      Nov 23 '17 at 20:45




      $begingroup$
      Great info. However, instead of using e, what about using a constant say 3, or 4 ? Will the outcome be the same?
      $endgroup$
      – Cheok Yan Cheng
      Nov 23 '17 at 20:45




      4




      4




      $begingroup$
      @CheokYanCheng, yes. But e has a nicer derivative ;)
      $endgroup$
      – vega
      Nov 25 '17 at 23:55




      $begingroup$
      @CheokYanCheng, yes. But e has a nicer derivative ;)
      $endgroup$
      – vega
      Nov 25 '17 at 23:55












      $begingroup$
      I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
      $endgroup$
      – jvalle
      Oct 26 '18 at 21:13




      $begingroup$
      I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
      $endgroup$
      – jvalle
      Oct 26 '18 at 21:13












      $begingroup$
      @vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
      $endgroup$
      – Gulzar
      Dec 9 '18 at 15:39




      $begingroup$
      @vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
      $endgroup$
      – Gulzar
      Dec 9 '18 at 15:39











      1












      $begingroup$

      In addition to vega's explanation,



      let's define generic softmax:
      $$P(y=j | x) = fracpsi^x_jsum_k=1^K psi^x_k$$
      where $psi$ is a constant >= 1



      if $psi=1$, then you are pretty far from argmax as @vega mentioned.



      Let's now assume $psi=100$, now you are pretty close to the argmax but you also have a really small numbers for negative values and big numbers for positives. This numbers overflows the float point arithmetic limit easily(for example maximum limit of numpy float64 is $10^308$). In addition to that, even if the selection is $psi=e$ which is much smaller than $100$, frameworks should implement a more stable version of softmax (multiplying both numerator and denominator with constant $C$) since results become to small to be able to express with such precision.



      So, you want to pick a constant big enough to approximate argmax well, and also small enough to express these big and small numbers in calculations.



      And of course, $e$ also has pretty nice derivative.






      share|improve this answer










      New contributor




      komunistbakkal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$

















        1












        $begingroup$

        In addition to vega's explanation,



        let's define generic softmax:
        $$P(y=j | x) = fracpsi^x_jsum_k=1^K psi^x_k$$
        where $psi$ is a constant >= 1



        if $psi=1$, then you are pretty far from argmax as @vega mentioned.



        Let's now assume $psi=100$, now you are pretty close to the argmax but you also have a really small numbers for negative values and big numbers for positives. This numbers overflows the float point arithmetic limit easily(for example maximum limit of numpy float64 is $10^308$). In addition to that, even if the selection is $psi=e$ which is much smaller than $100$, frameworks should implement a more stable version of softmax (multiplying both numerator and denominator with constant $C$) since results become to small to be able to express with such precision.



        So, you want to pick a constant big enough to approximate argmax well, and also small enough to express these big and small numbers in calculations.



        And of course, $e$ also has pretty nice derivative.






        share|improve this answer










        New contributor




        komunistbakkal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        $endgroup$















          1












          1








          1





          $begingroup$

          In addition to vega's explanation,



          let's define generic softmax:
          $$P(y=j | x) = fracpsi^x_jsum_k=1^K psi^x_k$$
          where $psi$ is a constant >= 1



          if $psi=1$, then you are pretty far from argmax as @vega mentioned.



          Let's now assume $psi=100$, now you are pretty close to the argmax but you also have a really small numbers for negative values and big numbers for positives. This numbers overflows the float point arithmetic limit easily(for example maximum limit of numpy float64 is $10^308$). In addition to that, even if the selection is $psi=e$ which is much smaller than $100$, frameworks should implement a more stable version of softmax (multiplying both numerator and denominator with constant $C$) since results become to small to be able to express with such precision.



          So, you want to pick a constant big enough to approximate argmax well, and also small enough to express these big and small numbers in calculations.



          And of course, $e$ also has pretty nice derivative.






          share|improve this answer










          New contributor




          komunistbakkal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          $endgroup$



          In addition to vega's explanation,



          let's define generic softmax:
          $$P(y=j | x) = fracpsi^x_jsum_k=1^K psi^x_k$$
          where $psi$ is a constant >= 1



          if $psi=1$, then you are pretty far from argmax as @vega mentioned.



          Let's now assume $psi=100$, now you are pretty close to the argmax but you also have a really small numbers for negative values and big numbers for positives. This numbers overflows the float point arithmetic limit easily(for example maximum limit of numpy float64 is $10^308$). In addition to that, even if the selection is $psi=e$ which is much smaller than $100$, frameworks should implement a more stable version of softmax (multiplying both numerator and denominator with constant $C$) since results become to small to be able to express with such precision.



          So, you want to pick a constant big enough to approximate argmax well, and also small enough to express these big and small numbers in calculations.



          And of course, $e$ also has pretty nice derivative.







          share|improve this answer










          New contributor




          komunistbakkal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          share|improve this answer



          share|improve this answer








          edited Mar 23 at 23:32





















          New contributor




          komunistbakkal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          answered Mar 23 at 18:58









          komunistbakkalkomunistbakkal

          112




          112




          New contributor




          komunistbakkal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.





          New contributor





          komunistbakkal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          komunistbakkal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.





















              1












              $begingroup$

              This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.



              I will explain this by using an example with $N$ images, which are constituted by $n_1$ images from the class $mathcalC_1$, $n_2$ images from the class $mathcalC_2$, ..., and $n_K$ images from the class $mathcalC_K$. Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' $E_k$ to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.



              The mean energy $barE$ is related to the other energies $E_k$ by the following relationship
              beginequation
              NbarE = sum_k=1^K n_k E_k.qquad (*)
              labeleq:mean_energy
              endequation



              At the same time, we see that the total amount of images can be calculated as the following sum



              beginequation
              N = sum_k=1^Kn_k.qquad (**)
              labeleq:conservation_of_particles
              endequation



              The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class $n_1$ it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.



              The entropy increases with the number of combinations in which we can split the $N$ images into the $n_1$, $n_2$, ..., $n_K$ image classes with corresponding energy. This number of combinations is given by the multinomial coefficient



              beginequation
              beginpmatrix
              N!\
              n_1!,n_2!,ldots,n_K!\
              endpmatrix=dfracN!prod_k=1^K n_k!.
              endequation



              We will try to maximize this number assuming that we have infinitely many images $Nto infty$. But his maximization has also equality constraints $(*)$ and $(**)$. This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers $beta$ and $alpha$ for the equality constraints and we introduce the Lagrange Funktion $mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right)$.



              beginequation
              mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right) = dfracN!prod_k=1^Kn_k!+betaleft[sum_k=1^Kn_k E_k - NbarEright]+alphaleft[N-sum_k=1^K n_kright]
              endequation



              As we assumed $Nto infty$ we can also assume $n_k to infty$ and use the Stirling approximation for the factorial



              beginequation
              ln n! = nln n - n + mathcalO(ln n).
              endequation



              Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to $ln n!$ for $nto infty$.



              The partial derivative of the Lagrange function with respect $n_tildek$ will result in



              $$dfracpartial mathcalLpartial n_tildek=-ln n_tildek-1-alpha+beta E_tildek.$$



              If we set this partial derivative to zero we can find



              $$n_tildek=dfracexp(beta E_tildek)exp(1+alpha). qquad (***)$$



              If we put this back into $(**)$ we can obtain



              $$exp(1+alpha)=dfrac1Nsum_k=1^Kexp(beta E_k).$$



              If we put this back into $(***)$ we get something that should remind us of the softmax function



              $$n_tildek=dfracexp(beta E_tildek)dfrac1Nsum_k=1^Kexp(beta E_k).$$



              If we define $n_tildek/N$ as the probability of class $mathcalC_tildek$ by $p_tildek$ we will obtain something that is really similar to the softmax function



              $$p_tildek=dfracexp(beta E_tildek)sum_k=1^Kexp(beta E_k).$$



              Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set $beta E_tildek=boldsymbolw^T_kboldsymbolx$ we exactly get the definition of the softmax function for the $k^textth$ output.






              share|improve this answer











              $endgroup$

















                1












                $begingroup$

                This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.



                I will explain this by using an example with $N$ images, which are constituted by $n_1$ images from the class $mathcalC_1$, $n_2$ images from the class $mathcalC_2$, ..., and $n_K$ images from the class $mathcalC_K$. Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' $E_k$ to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.



                The mean energy $barE$ is related to the other energies $E_k$ by the following relationship
                beginequation
                NbarE = sum_k=1^K n_k E_k.qquad (*)
                labeleq:mean_energy
                endequation



                At the same time, we see that the total amount of images can be calculated as the following sum



                beginequation
                N = sum_k=1^Kn_k.qquad (**)
                labeleq:conservation_of_particles
                endequation



                The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class $n_1$ it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.



                The entropy increases with the number of combinations in which we can split the $N$ images into the $n_1$, $n_2$, ..., $n_K$ image classes with corresponding energy. This number of combinations is given by the multinomial coefficient



                beginequation
                beginpmatrix
                N!\
                n_1!,n_2!,ldots,n_K!\
                endpmatrix=dfracN!prod_k=1^K n_k!.
                endequation



                We will try to maximize this number assuming that we have infinitely many images $Nto infty$. But his maximization has also equality constraints $(*)$ and $(**)$. This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers $beta$ and $alpha$ for the equality constraints and we introduce the Lagrange Funktion $mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right)$.



                beginequation
                mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right) = dfracN!prod_k=1^Kn_k!+betaleft[sum_k=1^Kn_k E_k - NbarEright]+alphaleft[N-sum_k=1^K n_kright]
                endequation



                As we assumed $Nto infty$ we can also assume $n_k to infty$ and use the Stirling approximation for the factorial



                beginequation
                ln n! = nln n - n + mathcalO(ln n).
                endequation



                Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to $ln n!$ for $nto infty$.



                The partial derivative of the Lagrange function with respect $n_tildek$ will result in



                $$dfracpartial mathcalLpartial n_tildek=-ln n_tildek-1-alpha+beta E_tildek.$$



                If we set this partial derivative to zero we can find



                $$n_tildek=dfracexp(beta E_tildek)exp(1+alpha). qquad (***)$$



                If we put this back into $(**)$ we can obtain



                $$exp(1+alpha)=dfrac1Nsum_k=1^Kexp(beta E_k).$$



                If we put this back into $(***)$ we get something that should remind us of the softmax function



                $$n_tildek=dfracexp(beta E_tildek)dfrac1Nsum_k=1^Kexp(beta E_k).$$



                If we define $n_tildek/N$ as the probability of class $mathcalC_tildek$ by $p_tildek$ we will obtain something that is really similar to the softmax function



                $$p_tildek=dfracexp(beta E_tildek)sum_k=1^Kexp(beta E_k).$$



                Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set $beta E_tildek=boldsymbolw^T_kboldsymbolx$ we exactly get the definition of the softmax function for the $k^textth$ output.






                share|improve this answer











                $endgroup$















                  1












                  1








                  1





                  $begingroup$

                  This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.



                  I will explain this by using an example with $N$ images, which are constituted by $n_1$ images from the class $mathcalC_1$, $n_2$ images from the class $mathcalC_2$, ..., and $n_K$ images from the class $mathcalC_K$. Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' $E_k$ to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.



                  The mean energy $barE$ is related to the other energies $E_k$ by the following relationship
                  beginequation
                  NbarE = sum_k=1^K n_k E_k.qquad (*)
                  labeleq:mean_energy
                  endequation



                  At the same time, we see that the total amount of images can be calculated as the following sum



                  beginequation
                  N = sum_k=1^Kn_k.qquad (**)
                  labeleq:conservation_of_particles
                  endequation



                  The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class $n_1$ it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.



                  The entropy increases with the number of combinations in which we can split the $N$ images into the $n_1$, $n_2$, ..., $n_K$ image classes with corresponding energy. This number of combinations is given by the multinomial coefficient



                  beginequation
                  beginpmatrix
                  N!\
                  n_1!,n_2!,ldots,n_K!\
                  endpmatrix=dfracN!prod_k=1^K n_k!.
                  endequation



                  We will try to maximize this number assuming that we have infinitely many images $Nto infty$. But his maximization has also equality constraints $(*)$ and $(**)$. This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers $beta$ and $alpha$ for the equality constraints and we introduce the Lagrange Funktion $mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right)$.



                  beginequation
                  mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right) = dfracN!prod_k=1^Kn_k!+betaleft[sum_k=1^Kn_k E_k - NbarEright]+alphaleft[N-sum_k=1^K n_kright]
                  endequation



                  As we assumed $Nto infty$ we can also assume $n_k to infty$ and use the Stirling approximation for the factorial



                  beginequation
                  ln n! = nln n - n + mathcalO(ln n).
                  endequation



                  Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to $ln n!$ for $nto infty$.



                  The partial derivative of the Lagrange function with respect $n_tildek$ will result in



                  $$dfracpartial mathcalLpartial n_tildek=-ln n_tildek-1-alpha+beta E_tildek.$$



                  If we set this partial derivative to zero we can find



                  $$n_tildek=dfracexp(beta E_tildek)exp(1+alpha). qquad (***)$$



                  If we put this back into $(**)$ we can obtain



                  $$exp(1+alpha)=dfrac1Nsum_k=1^Kexp(beta E_k).$$



                  If we put this back into $(***)$ we get something that should remind us of the softmax function



                  $$n_tildek=dfracexp(beta E_tildek)dfrac1Nsum_k=1^Kexp(beta E_k).$$



                  If we define $n_tildek/N$ as the probability of class $mathcalC_tildek$ by $p_tildek$ we will obtain something that is really similar to the softmax function



                  $$p_tildek=dfracexp(beta E_tildek)sum_k=1^Kexp(beta E_k).$$



                  Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set $beta E_tildek=boldsymbolw^T_kboldsymbolx$ we exactly get the definition of the softmax function for the $k^textth$ output.






                  share|improve this answer











                  $endgroup$



                  This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.



                  I will explain this by using an example with $N$ images, which are constituted by $n_1$ images from the class $mathcalC_1$, $n_2$ images from the class $mathcalC_2$, ..., and $n_K$ images from the class $mathcalC_K$. Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' $E_k$ to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.



                  The mean energy $barE$ is related to the other energies $E_k$ by the following relationship
                  beginequation
                  NbarE = sum_k=1^K n_k E_k.qquad (*)
                  labeleq:mean_energy
                  endequation



                  At the same time, we see that the total amount of images can be calculated as the following sum



                  beginequation
                  N = sum_k=1^Kn_k.qquad (**)
                  labeleq:conservation_of_particles
                  endequation



                  The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class $n_1$ it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.



                  The entropy increases with the number of combinations in which we can split the $N$ images into the $n_1$, $n_2$, ..., $n_K$ image classes with corresponding energy. This number of combinations is given by the multinomial coefficient



                  beginequation
                  beginpmatrix
                  N!\
                  n_1!,n_2!,ldots,n_K!\
                  endpmatrix=dfracN!prod_k=1^K n_k!.
                  endequation



                  We will try to maximize this number assuming that we have infinitely many images $Nto infty$. But his maximization has also equality constraints $(*)$ and $(**)$. This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers $beta$ and $alpha$ for the equality constraints and we introduce the Lagrange Funktion $mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right)$.



                  beginequation
                  mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right) = dfracN!prod_k=1^Kn_k!+betaleft[sum_k=1^Kn_k E_k - NbarEright]+alphaleft[N-sum_k=1^K n_kright]
                  endequation



                  As we assumed $Nto infty$ we can also assume $n_k to infty$ and use the Stirling approximation for the factorial



                  beginequation
                  ln n! = nln n - n + mathcalO(ln n).
                  endequation



                  Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to $ln n!$ for $nto infty$.



                  The partial derivative of the Lagrange function with respect $n_tildek$ will result in



                  $$dfracpartial mathcalLpartial n_tildek=-ln n_tildek-1-alpha+beta E_tildek.$$



                  If we set this partial derivative to zero we can find



                  $$n_tildek=dfracexp(beta E_tildek)exp(1+alpha). qquad (***)$$



                  If we put this back into $(**)$ we can obtain



                  $$exp(1+alpha)=dfrac1Nsum_k=1^Kexp(beta E_k).$$



                  If we put this back into $(***)$ we get something that should remind us of the softmax function



                  $$n_tildek=dfracexp(beta E_tildek)dfrac1Nsum_k=1^Kexp(beta E_k).$$



                  If we define $n_tildek/N$ as the probability of class $mathcalC_tildek$ by $p_tildek$ we will obtain something that is really similar to the softmax function



                  $$p_tildek=dfracexp(beta E_tildek)sum_k=1^Kexp(beta E_k).$$



                  Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set $beta E_tildek=boldsymbolw^T_kboldsymbolx$ we exactly get the definition of the softmax function for the $k^textth$ output.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Mar 24 at 11:10

























                  answered Mar 24 at 8:49









                  MachineLearnerMachineLearner

                  35410




                  35410



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f23159%2fin-softmax-classifier-why-use-exp-function-to-do-normalization%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                      Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

                      Do these cracks on my tires look bad? The Next CEO of Stack OverflowDry rot tire should I replace?Having to replace tiresFishtailed so easily? Bad tires? ABS?Filling the tires with something other than air, to avoid puncture hassles?Used Michelin tires safe to install?Do these tyre cracks necessitate replacement?Rumbling noise: tires or mechanicalIs it possible to fix noisy feathered tires?Are bad winter tires still better than summer tires in winter?Torque converter failure - Related to replacing only 2 tires?Why use snow tires on all 4 wheels on 2-wheel-drive cars?