In softmax classifier, why use exp function to do normalization?2019 Community Moderator ElectionNeural networks: which cost function to use?TensorFlow doesn't learn when input=output (or probably I am missing something)Loss given Activation Function and Probability ModelIs cross-entropy a good cost function if I'm interested in the probabilities of a sample belonging to a certain class?Fractions or probabilities as training labelsFinal layer of neural network responsible for overfittingasymmetric cost function for deep neural network binary classifierAdjusted coefficient pearson as CNN loss functionweighted cross entropy for imbalanced dataset - multiclass classification
Short story about space worker geeks who zone out by 'listening' to radiation from stars
What can we do to stop prior company from asking us questions?
Customer Requests (Sometimes) Drive Me Bonkers!
I'm in charge of equipment buying but no one's ever happy with what I choose. How to fix this?
Do the temporary hit points from the Battlerager barbarian's Reckless Abandon stack if I make multiple attacks on my turn?
Abbreviate author names as "Lastname AB" (without space or period) in bibliography
Lay out the Carpet
How can I quit an app using Terminal?
How do I extract a value from a time formatted value in excel?
Why Were Madagascar and New Zealand Discovered So Late?
Purchasing a ticket for someone else in another country?
Pole-zeros of a real-valued causal FIR system
Term for the "extreme-extension" version of a straw man fallacy?
How do I rename a Linux host without needing to reboot for the rename to take effect?
What is the opposite of 'gravitas'?
How long to clear the 'suck zone' of a turbofan after start is initiated?
Number of words that can be made using all the letters of the word W, if Os as well as Is are separated is?
when is out of tune ok?
How did Doctor Strange see the winning outcome in Avengers: Infinity War?
How to be diplomatic in refusing to write code that breaches the privacy of our users
How does buying out courses with grant money work?
Sort a list by elements of another list
Is this apparent Class Action settlement a spam message?
What grammatical function is や performing here?
In softmax classifier, why use exp function to do normalization?
2019 Community Moderator ElectionNeural networks: which cost function to use?TensorFlow doesn't learn when input=output (or probably I am missing something)Loss given Activation Function and Probability ModelIs cross-entropy a good cost function if I'm interested in the probabilities of a sample belonging to a certain class?Fractions or probabilities as training labelsFinal layer of neural network responsible for overfittingasymmetric cost function for deep neural network binary classifierAdjusted coefficient pearson as CNN loss functionweighted cross entropy for imbalanced dataset - multiclass classification
$begingroup$
Why use softmax as opposed to standard normalization? In the comment area of the top answer of this question, @Kilian Batzner raised 2 questions which also confuse me a lot. It seems no one gives an explanation except numerical benefits.
I get the reasons for using Cross-Entropy Loss, but how does that relate to the softmax? You said "the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth". Suppose, I would use standard / linear normalization, but still use the Cross-Entropy Loss. Then I would also try to minimize the Cross-Entropy. So how is the softmax linked to the Cross-Entropy except for the numerical benefits?
As for the probabilistic view: what is the motivation for looking at log probabilities? The reasoning seems to be a bit like "We use e^x in the softmax, because we interpret x as log-probabilties". With the same reasoning we could say, we use e^e^e^x in the softmax, because we interpret x as log-log-log-probabilities (Exaggerating here, of course). I get the numerical benefits of softmax, but what is the theoretical motivation for using it?
machine-learning deep-learning
$endgroup$
add a comment |
$begingroup$
Why use softmax as opposed to standard normalization? In the comment area of the top answer of this question, @Kilian Batzner raised 2 questions which also confuse me a lot. It seems no one gives an explanation except numerical benefits.
I get the reasons for using Cross-Entropy Loss, but how does that relate to the softmax? You said "the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth". Suppose, I would use standard / linear normalization, but still use the Cross-Entropy Loss. Then I would also try to minimize the Cross-Entropy. So how is the softmax linked to the Cross-Entropy except for the numerical benefits?
As for the probabilistic view: what is the motivation for looking at log probabilities? The reasoning seems to be a bit like "We use e^x in the softmax, because we interpret x as log-probabilties". With the same reasoning we could say, we use e^e^e^x in the softmax, because we interpret x as log-log-log-probabilities (Exaggerating here, of course). I get the numerical benefits of softmax, but what is the theoretical motivation for using it?
machine-learning deep-learning
$endgroup$
$begingroup$
It is differentiable, leads to non-negative results (such as would be necessary for a probability so the cross-entropy can be calculated), and behaves like the max function, which is appropriate in a classification setting. Welcome to the site!
$endgroup$
– Emre
Sep 20 '17 at 6:09
$begingroup$
@Emre Thanks! But what does "behaves like max function" mean? Besides, if I have another function that is also differentiable, monotone increasing and leads to non-negative results, can I use it to replace the exp function in the formula?
$endgroup$
– Hans
Sep 20 '17 at 7:12
$begingroup$
When you normalize using $max$, the greatest argument gets mapped to 1 while the rest get mapped to zero, owing to the growth of the exponential fuction.
$endgroup$
– Emre
Sep 20 '17 at 16:02
add a comment |
$begingroup$
Why use softmax as opposed to standard normalization? In the comment area of the top answer of this question, @Kilian Batzner raised 2 questions which also confuse me a lot. It seems no one gives an explanation except numerical benefits.
I get the reasons for using Cross-Entropy Loss, but how does that relate to the softmax? You said "the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth". Suppose, I would use standard / linear normalization, but still use the Cross-Entropy Loss. Then I would also try to minimize the Cross-Entropy. So how is the softmax linked to the Cross-Entropy except for the numerical benefits?
As for the probabilistic view: what is the motivation for looking at log probabilities? The reasoning seems to be a bit like "We use e^x in the softmax, because we interpret x as log-probabilties". With the same reasoning we could say, we use e^e^e^x in the softmax, because we interpret x as log-log-log-probabilities (Exaggerating here, of course). I get the numerical benefits of softmax, but what is the theoretical motivation for using it?
machine-learning deep-learning
$endgroup$
Why use softmax as opposed to standard normalization? In the comment area of the top answer of this question, @Kilian Batzner raised 2 questions which also confuse me a lot. It seems no one gives an explanation except numerical benefits.
I get the reasons for using Cross-Entropy Loss, but how does that relate to the softmax? You said "the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth". Suppose, I would use standard / linear normalization, but still use the Cross-Entropy Loss. Then I would also try to minimize the Cross-Entropy. So how is the softmax linked to the Cross-Entropy except for the numerical benefits?
As for the probabilistic view: what is the motivation for looking at log probabilities? The reasoning seems to be a bit like "We use e^x in the softmax, because we interpret x as log-probabilties". With the same reasoning we could say, we use e^e^e^x in the softmax, because we interpret x as log-log-log-probabilities (Exaggerating here, of course). I get the numerical benefits of softmax, but what is the theoretical motivation for using it?
machine-learning deep-learning
machine-learning deep-learning
asked Sep 20 '17 at 5:53
HansHans
9315
9315
$begingroup$
It is differentiable, leads to non-negative results (such as would be necessary for a probability so the cross-entropy can be calculated), and behaves like the max function, which is appropriate in a classification setting. Welcome to the site!
$endgroup$
– Emre
Sep 20 '17 at 6:09
$begingroup$
@Emre Thanks! But what does "behaves like max function" mean? Besides, if I have another function that is also differentiable, monotone increasing and leads to non-negative results, can I use it to replace the exp function in the formula?
$endgroup$
– Hans
Sep 20 '17 at 7:12
$begingroup$
When you normalize using $max$, the greatest argument gets mapped to 1 while the rest get mapped to zero, owing to the growth of the exponential fuction.
$endgroup$
– Emre
Sep 20 '17 at 16:02
add a comment |
$begingroup$
It is differentiable, leads to non-negative results (such as would be necessary for a probability so the cross-entropy can be calculated), and behaves like the max function, which is appropriate in a classification setting. Welcome to the site!
$endgroup$
– Emre
Sep 20 '17 at 6:09
$begingroup$
@Emre Thanks! But what does "behaves like max function" mean? Besides, if I have another function that is also differentiable, monotone increasing and leads to non-negative results, can I use it to replace the exp function in the formula?
$endgroup$
– Hans
Sep 20 '17 at 7:12
$begingroup$
When you normalize using $max$, the greatest argument gets mapped to 1 while the rest get mapped to zero, owing to the growth of the exponential fuction.
$endgroup$
– Emre
Sep 20 '17 at 16:02
$begingroup$
It is differentiable, leads to non-negative results (such as would be necessary for a probability so the cross-entropy can be calculated), and behaves like the max function, which is appropriate in a classification setting. Welcome to the site!
$endgroup$
– Emre
Sep 20 '17 at 6:09
$begingroup$
It is differentiable, leads to non-negative results (such as would be necessary for a probability so the cross-entropy can be calculated), and behaves like the max function, which is appropriate in a classification setting. Welcome to the site!
$endgroup$
– Emre
Sep 20 '17 at 6:09
$begingroup$
@Emre Thanks! But what does "behaves like max function" mean? Besides, if I have another function that is also differentiable, monotone increasing and leads to non-negative results, can I use it to replace the exp function in the formula?
$endgroup$
– Hans
Sep 20 '17 at 7:12
$begingroup$
@Emre Thanks! But what does "behaves like max function" mean? Besides, if I have another function that is also differentiable, monotone increasing and leads to non-negative results, can I use it to replace the exp function in the formula?
$endgroup$
– Hans
Sep 20 '17 at 7:12
$begingroup$
When you normalize using $max$, the greatest argument gets mapped to 1 while the rest get mapped to zero, owing to the growth of the exponential fuction.
$endgroup$
– Emre
Sep 20 '17 at 16:02
$begingroup$
When you normalize using $max$, the greatest argument gets mapped to 1 while the rest get mapped to zero, owing to the growth of the exponential fuction.
$endgroup$
– Emre
Sep 20 '17 at 16:02
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
It is more than just numerical. A quick reminder of the softmax:
$$
P(y=j | x) = frace^x_jsum_k=1^K e^x_k
$$
Where $x$ is an input vector with length equal to the number of classes $K$. The softmax function has 3 very nice properties: 1. it normalizes your data (outputs a proper probability distribution), 2. is differentiable, and 3. it uses the exp you mentioned. A few important points:
The loss function is not directly related to softmax. You can use standard normalization and still use cross-entropy.
A "hardmax" function (i.e. argmax) is not differentiable. The softmax gives at least a minimal amount of probability to all elements in the output vector, and so is nicely differentiable, hence the term "soft" in softmax.
Now I get to your question. The $e$ in softmax is the natural exponential function. Before we normalize, we transform $x$ as in the graph of $e^x$:
If $x$ is 0 then $y=1$, if $x$ is 1, then $y=2.7$, and if $x$ is 2, now $y=7$! A huge step! This is what's called a non-linear transformation of our unnormalized log scores. The interesting property of the exponential function combined with the normalization in the softmax is that high scores in $x$ become much more probable than low scores.
An example. Say $K=4$, and your log score $x$ is vector $[2, 4, 2, 1]$. The simple argmax function outputs:
$$
[0, 1, 0, 0]
$$
The argmax is the goal, but it's not differentiable and we can't train our model with it :( A simple normalization, which is differentiable, outputs the following probabilities:
$$
[0.2222, 0.4444, 0.2222, 0.1111]
$$
That's really far from the argmax! :( Whereas the softmax outputs:
$$
[0.1025, 0.7573, 0.1025, 0.0377]
$$
That's much closer to the argmax! Because we use the natural exponential, we hugely increase the probability of the biggest score and decrease the probability of the lower scores when compared with standard normalization. Hence the "max" in softmax.
$endgroup$
$begingroup$
@Hans, awesome! If this answered your question, please consider clicking it as answered
$endgroup$
– vega
Oct 31 '17 at 18:59
3
$begingroup$
Great info. However, instead of usinge
, what about using a constant say 3, or 4 ? Will the outcome be the same?
$endgroup$
– Cheok Yan Cheng
Nov 23 '17 at 20:45
4
$begingroup$
@CheokYanCheng, yes. Bute
has a nicer derivative ;)
$endgroup$
– vega
Nov 25 '17 at 23:55
$begingroup$
I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
$endgroup$
– jvalle
Oct 26 '18 at 21:13
$begingroup$
@vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
$endgroup$
– Gulzar
Dec 9 '18 at 15:39
|
show 5 more comments
$begingroup$
In addition to vega's explanation,
let's define generic softmax:
$$P(y=j | x) = fracpsi^x_jsum_k=1^K psi^x_k$$
where $psi$ is a constant >= 1
if $psi=1$, then you are pretty far from argmax as @vega mentioned.
Let's now assume $psi=100$, now you are pretty close to the argmax but you also have a really small numbers for negative values and big numbers for positives. This numbers overflows the float point arithmetic limit easily(for example maximum limit of numpy float64 is $10^308$). In addition to that, even if the selection is $psi=e$ which is much smaller than $100$, frameworks should implement a more stable version of softmax (multiplying both numerator and denominator with constant $C$) since results become to small to be able to express with such precision.
So, you want to pick a constant big enough to approximate argmax well, and also small enough to express these big and small numbers in calculations.
And of course, $e$ also has pretty nice derivative.
New contributor
$endgroup$
add a comment |
$begingroup$
This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.
I will explain this by using an example with $N$ images, which are constituted by $n_1$ images from the class $mathcalC_1$, $n_2$ images from the class $mathcalC_2$, ..., and $n_K$ images from the class $mathcalC_K$. Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' $E_k$ to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.
The mean energy $barE$ is related to the other energies $E_k$ by the following relationship
beginequation
NbarE = sum_k=1^K n_k E_k.qquad (*)
labeleq:mean_energy
endequation
At the same time, we see that the total amount of images can be calculated as the following sum
beginequation
N = sum_k=1^Kn_k.qquad (**)
labeleq:conservation_of_particles
endequation
The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class $n_1$ it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.
The entropy increases with the number of combinations in which we can split the $N$ images into the $n_1$, $n_2$, ..., $n_K$ image classes with corresponding energy. This number of combinations is given by the multinomial coefficient
beginequation
beginpmatrix
N!\
n_1!,n_2!,ldots,n_K!\
endpmatrix=dfracN!prod_k=1^K n_k!.
endequation
We will try to maximize this number assuming that we have infinitely many images $Nto infty$. But his maximization has also equality constraints $(*)$ and $(**)$. This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers $beta$ and $alpha$ for the equality constraints and we introduce the Lagrange Funktion $mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right)$.
beginequation
mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right) = dfracN!prod_k=1^Kn_k!+betaleft[sum_k=1^Kn_k E_k - NbarEright]+alphaleft[N-sum_k=1^K n_kright]
endequation
As we assumed $Nto infty$ we can also assume $n_k to infty$ and use the Stirling approximation for the factorial
beginequation
ln n! = nln n - n + mathcalO(ln n).
endequation
Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to $ln n!$ for $nto infty$.
The partial derivative of the Lagrange function with respect $n_tildek$ will result in
$$dfracpartial mathcalLpartial n_tildek=-ln n_tildek-1-alpha+beta E_tildek.$$
If we set this partial derivative to zero we can find
$$n_tildek=dfracexp(beta E_tildek)exp(1+alpha). qquad (***)$$
If we put this back into $(**)$ we can obtain
$$exp(1+alpha)=dfrac1Nsum_k=1^Kexp(beta E_k).$$
If we put this back into $(***)$ we get something that should remind us of the softmax function
$$n_tildek=dfracexp(beta E_tildek)dfrac1Nsum_k=1^Kexp(beta E_k).$$
If we define $n_tildek/N$ as the probability of class $mathcalC_tildek$ by $p_tildek$ we will obtain something that is really similar to the softmax function
$$p_tildek=dfracexp(beta E_tildek)sum_k=1^Kexp(beta E_k).$$
Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set $beta E_tildek=boldsymbolw^T_kboldsymbolx$ we exactly get the definition of the softmax function for the $k^textth$ output.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f23159%2fin-softmax-classifier-why-use-exp-function-to-do-normalization%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
It is more than just numerical. A quick reminder of the softmax:
$$
P(y=j | x) = frace^x_jsum_k=1^K e^x_k
$$
Where $x$ is an input vector with length equal to the number of classes $K$. The softmax function has 3 very nice properties: 1. it normalizes your data (outputs a proper probability distribution), 2. is differentiable, and 3. it uses the exp you mentioned. A few important points:
The loss function is not directly related to softmax. You can use standard normalization and still use cross-entropy.
A "hardmax" function (i.e. argmax) is not differentiable. The softmax gives at least a minimal amount of probability to all elements in the output vector, and so is nicely differentiable, hence the term "soft" in softmax.
Now I get to your question. The $e$ in softmax is the natural exponential function. Before we normalize, we transform $x$ as in the graph of $e^x$:
If $x$ is 0 then $y=1$, if $x$ is 1, then $y=2.7$, and if $x$ is 2, now $y=7$! A huge step! This is what's called a non-linear transformation of our unnormalized log scores. The interesting property of the exponential function combined with the normalization in the softmax is that high scores in $x$ become much more probable than low scores.
An example. Say $K=4$, and your log score $x$ is vector $[2, 4, 2, 1]$. The simple argmax function outputs:
$$
[0, 1, 0, 0]
$$
The argmax is the goal, but it's not differentiable and we can't train our model with it :( A simple normalization, which is differentiable, outputs the following probabilities:
$$
[0.2222, 0.4444, 0.2222, 0.1111]
$$
That's really far from the argmax! :( Whereas the softmax outputs:
$$
[0.1025, 0.7573, 0.1025, 0.0377]
$$
That's much closer to the argmax! Because we use the natural exponential, we hugely increase the probability of the biggest score and decrease the probability of the lower scores when compared with standard normalization. Hence the "max" in softmax.
$endgroup$
$begingroup$
@Hans, awesome! If this answered your question, please consider clicking it as answered
$endgroup$
– vega
Oct 31 '17 at 18:59
3
$begingroup$
Great info. However, instead of usinge
, what about using a constant say 3, or 4 ? Will the outcome be the same?
$endgroup$
– Cheok Yan Cheng
Nov 23 '17 at 20:45
4
$begingroup$
@CheokYanCheng, yes. Bute
has a nicer derivative ;)
$endgroup$
– vega
Nov 25 '17 at 23:55
$begingroup$
I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
$endgroup$
– jvalle
Oct 26 '18 at 21:13
$begingroup$
@vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
$endgroup$
– Gulzar
Dec 9 '18 at 15:39
|
show 5 more comments
$begingroup$
It is more than just numerical. A quick reminder of the softmax:
$$
P(y=j | x) = frace^x_jsum_k=1^K e^x_k
$$
Where $x$ is an input vector with length equal to the number of classes $K$. The softmax function has 3 very nice properties: 1. it normalizes your data (outputs a proper probability distribution), 2. is differentiable, and 3. it uses the exp you mentioned. A few important points:
The loss function is not directly related to softmax. You can use standard normalization and still use cross-entropy.
A "hardmax" function (i.e. argmax) is not differentiable. The softmax gives at least a minimal amount of probability to all elements in the output vector, and so is nicely differentiable, hence the term "soft" in softmax.
Now I get to your question. The $e$ in softmax is the natural exponential function. Before we normalize, we transform $x$ as in the graph of $e^x$:
If $x$ is 0 then $y=1$, if $x$ is 1, then $y=2.7$, and if $x$ is 2, now $y=7$! A huge step! This is what's called a non-linear transformation of our unnormalized log scores. The interesting property of the exponential function combined with the normalization in the softmax is that high scores in $x$ become much more probable than low scores.
An example. Say $K=4$, and your log score $x$ is vector $[2, 4, 2, 1]$. The simple argmax function outputs:
$$
[0, 1, 0, 0]
$$
The argmax is the goal, but it's not differentiable and we can't train our model with it :( A simple normalization, which is differentiable, outputs the following probabilities:
$$
[0.2222, 0.4444, 0.2222, 0.1111]
$$
That's really far from the argmax! :( Whereas the softmax outputs:
$$
[0.1025, 0.7573, 0.1025, 0.0377]
$$
That's much closer to the argmax! Because we use the natural exponential, we hugely increase the probability of the biggest score and decrease the probability of the lower scores when compared with standard normalization. Hence the "max" in softmax.
$endgroup$
$begingroup$
@Hans, awesome! If this answered your question, please consider clicking it as answered
$endgroup$
– vega
Oct 31 '17 at 18:59
3
$begingroup$
Great info. However, instead of usinge
, what about using a constant say 3, or 4 ? Will the outcome be the same?
$endgroup$
– Cheok Yan Cheng
Nov 23 '17 at 20:45
4
$begingroup$
@CheokYanCheng, yes. Bute
has a nicer derivative ;)
$endgroup$
– vega
Nov 25 '17 at 23:55
$begingroup$
I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
$endgroup$
– jvalle
Oct 26 '18 at 21:13
$begingroup$
@vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
$endgroup$
– Gulzar
Dec 9 '18 at 15:39
|
show 5 more comments
$begingroup$
It is more than just numerical. A quick reminder of the softmax:
$$
P(y=j | x) = frace^x_jsum_k=1^K e^x_k
$$
Where $x$ is an input vector with length equal to the number of classes $K$. The softmax function has 3 very nice properties: 1. it normalizes your data (outputs a proper probability distribution), 2. is differentiable, and 3. it uses the exp you mentioned. A few important points:
The loss function is not directly related to softmax. You can use standard normalization and still use cross-entropy.
A "hardmax" function (i.e. argmax) is not differentiable. The softmax gives at least a minimal amount of probability to all elements in the output vector, and so is nicely differentiable, hence the term "soft" in softmax.
Now I get to your question. The $e$ in softmax is the natural exponential function. Before we normalize, we transform $x$ as in the graph of $e^x$:
If $x$ is 0 then $y=1$, if $x$ is 1, then $y=2.7$, and if $x$ is 2, now $y=7$! A huge step! This is what's called a non-linear transformation of our unnormalized log scores. The interesting property of the exponential function combined with the normalization in the softmax is that high scores in $x$ become much more probable than low scores.
An example. Say $K=4$, and your log score $x$ is vector $[2, 4, 2, 1]$. The simple argmax function outputs:
$$
[0, 1, 0, 0]
$$
The argmax is the goal, but it's not differentiable and we can't train our model with it :( A simple normalization, which is differentiable, outputs the following probabilities:
$$
[0.2222, 0.4444, 0.2222, 0.1111]
$$
That's really far from the argmax! :( Whereas the softmax outputs:
$$
[0.1025, 0.7573, 0.1025, 0.0377]
$$
That's much closer to the argmax! Because we use the natural exponential, we hugely increase the probability of the biggest score and decrease the probability of the lower scores when compared with standard normalization. Hence the "max" in softmax.
$endgroup$
It is more than just numerical. A quick reminder of the softmax:
$$
P(y=j | x) = frace^x_jsum_k=1^K e^x_k
$$
Where $x$ is an input vector with length equal to the number of classes $K$. The softmax function has 3 very nice properties: 1. it normalizes your data (outputs a proper probability distribution), 2. is differentiable, and 3. it uses the exp you mentioned. A few important points:
The loss function is not directly related to softmax. You can use standard normalization and still use cross-entropy.
A "hardmax" function (i.e. argmax) is not differentiable. The softmax gives at least a minimal amount of probability to all elements in the output vector, and so is nicely differentiable, hence the term "soft" in softmax.
Now I get to your question. The $e$ in softmax is the natural exponential function. Before we normalize, we transform $x$ as in the graph of $e^x$:
If $x$ is 0 then $y=1$, if $x$ is 1, then $y=2.7$, and if $x$ is 2, now $y=7$! A huge step! This is what's called a non-linear transformation of our unnormalized log scores. The interesting property of the exponential function combined with the normalization in the softmax is that high scores in $x$ become much more probable than low scores.
An example. Say $K=4$, and your log score $x$ is vector $[2, 4, 2, 1]$. The simple argmax function outputs:
$$
[0, 1, 0, 0]
$$
The argmax is the goal, but it's not differentiable and we can't train our model with it :( A simple normalization, which is differentiable, outputs the following probabilities:
$$
[0.2222, 0.4444, 0.2222, 0.1111]
$$
That's really far from the argmax! :( Whereas the softmax outputs:
$$
[0.1025, 0.7573, 0.1025, 0.0377]
$$
That's much closer to the argmax! Because we use the natural exponential, we hugely increase the probability of the biggest score and decrease the probability of the lower scores when compared with standard normalization. Hence the "max" in softmax.
edited Oct 27 '17 at 13:02
answered Oct 26 '17 at 20:49
vegavega
36635
36635
$begingroup$
@Hans, awesome! If this answered your question, please consider clicking it as answered
$endgroup$
– vega
Oct 31 '17 at 18:59
3
$begingroup$
Great info. However, instead of usinge
, what about using a constant say 3, or 4 ? Will the outcome be the same?
$endgroup$
– Cheok Yan Cheng
Nov 23 '17 at 20:45
4
$begingroup$
@CheokYanCheng, yes. Bute
has a nicer derivative ;)
$endgroup$
– vega
Nov 25 '17 at 23:55
$begingroup$
I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
$endgroup$
– jvalle
Oct 26 '18 at 21:13
$begingroup$
@vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
$endgroup$
– Gulzar
Dec 9 '18 at 15:39
|
show 5 more comments
$begingroup$
@Hans, awesome! If this answered your question, please consider clicking it as answered
$endgroup$
– vega
Oct 31 '17 at 18:59
3
$begingroup$
Great info. However, instead of usinge
, what about using a constant say 3, or 4 ? Will the outcome be the same?
$endgroup$
– Cheok Yan Cheng
Nov 23 '17 at 20:45
4
$begingroup$
@CheokYanCheng, yes. Bute
has a nicer derivative ;)
$endgroup$
– vega
Nov 25 '17 at 23:55
$begingroup$
I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
$endgroup$
– jvalle
Oct 26 '18 at 21:13
$begingroup$
@vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
$endgroup$
– Gulzar
Dec 9 '18 at 15:39
$begingroup$
@Hans, awesome! If this answered your question, please consider clicking it as answered
$endgroup$
– vega
Oct 31 '17 at 18:59
$begingroup$
@Hans, awesome! If this answered your question, please consider clicking it as answered
$endgroup$
– vega
Oct 31 '17 at 18:59
3
3
$begingroup$
Great info. However, instead of using
e
, what about using a constant say 3, or 4 ? Will the outcome be the same?$endgroup$
– Cheok Yan Cheng
Nov 23 '17 at 20:45
$begingroup$
Great info. However, instead of using
e
, what about using a constant say 3, or 4 ? Will the outcome be the same?$endgroup$
– Cheok Yan Cheng
Nov 23 '17 at 20:45
4
4
$begingroup$
@CheokYanCheng, yes. But
e
has a nicer derivative ;)$endgroup$
– vega
Nov 25 '17 at 23:55
$begingroup$
@CheokYanCheng, yes. But
e
has a nicer derivative ;)$endgroup$
– vega
Nov 25 '17 at 23:55
$begingroup$
I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
$endgroup$
– jvalle
Oct 26 '18 at 21:13
$begingroup$
I have seen that the result of softmax is typically used as the probabilities of belonging to each class. If the choice of 'e' instead of other constant is arbitrary, it does not make sense to see it in terms of probability, right?
$endgroup$
– jvalle
Oct 26 '18 at 21:13
$begingroup$
@vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
$endgroup$
– Gulzar
Dec 9 '18 at 15:39
$begingroup$
@vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain
$endgroup$
– Gulzar
Dec 9 '18 at 15:39
|
show 5 more comments
$begingroup$
In addition to vega's explanation,
let's define generic softmax:
$$P(y=j | x) = fracpsi^x_jsum_k=1^K psi^x_k$$
where $psi$ is a constant >= 1
if $psi=1$, then you are pretty far from argmax as @vega mentioned.
Let's now assume $psi=100$, now you are pretty close to the argmax but you also have a really small numbers for negative values and big numbers for positives. This numbers overflows the float point arithmetic limit easily(for example maximum limit of numpy float64 is $10^308$). In addition to that, even if the selection is $psi=e$ which is much smaller than $100$, frameworks should implement a more stable version of softmax (multiplying both numerator and denominator with constant $C$) since results become to small to be able to express with such precision.
So, you want to pick a constant big enough to approximate argmax well, and also small enough to express these big and small numbers in calculations.
And of course, $e$ also has pretty nice derivative.
New contributor
$endgroup$
add a comment |
$begingroup$
In addition to vega's explanation,
let's define generic softmax:
$$P(y=j | x) = fracpsi^x_jsum_k=1^K psi^x_k$$
where $psi$ is a constant >= 1
if $psi=1$, then you are pretty far from argmax as @vega mentioned.
Let's now assume $psi=100$, now you are pretty close to the argmax but you also have a really small numbers for negative values and big numbers for positives. This numbers overflows the float point arithmetic limit easily(for example maximum limit of numpy float64 is $10^308$). In addition to that, even if the selection is $psi=e$ which is much smaller than $100$, frameworks should implement a more stable version of softmax (multiplying both numerator and denominator with constant $C$) since results become to small to be able to express with such precision.
So, you want to pick a constant big enough to approximate argmax well, and also small enough to express these big and small numbers in calculations.
And of course, $e$ also has pretty nice derivative.
New contributor
$endgroup$
add a comment |
$begingroup$
In addition to vega's explanation,
let's define generic softmax:
$$P(y=j | x) = fracpsi^x_jsum_k=1^K psi^x_k$$
where $psi$ is a constant >= 1
if $psi=1$, then you are pretty far from argmax as @vega mentioned.
Let's now assume $psi=100$, now you are pretty close to the argmax but you also have a really small numbers for negative values and big numbers for positives. This numbers overflows the float point arithmetic limit easily(for example maximum limit of numpy float64 is $10^308$). In addition to that, even if the selection is $psi=e$ which is much smaller than $100$, frameworks should implement a more stable version of softmax (multiplying both numerator and denominator with constant $C$) since results become to small to be able to express with such precision.
So, you want to pick a constant big enough to approximate argmax well, and also small enough to express these big and small numbers in calculations.
And of course, $e$ also has pretty nice derivative.
New contributor
$endgroup$
In addition to vega's explanation,
let's define generic softmax:
$$P(y=j | x) = fracpsi^x_jsum_k=1^K psi^x_k$$
where $psi$ is a constant >= 1
if $psi=1$, then you are pretty far from argmax as @vega mentioned.
Let's now assume $psi=100$, now you are pretty close to the argmax but you also have a really small numbers for negative values and big numbers for positives. This numbers overflows the float point arithmetic limit easily(for example maximum limit of numpy float64 is $10^308$). In addition to that, even if the selection is $psi=e$ which is much smaller than $100$, frameworks should implement a more stable version of softmax (multiplying both numerator and denominator with constant $C$) since results become to small to be able to express with such precision.
So, you want to pick a constant big enough to approximate argmax well, and also small enough to express these big and small numbers in calculations.
And of course, $e$ also has pretty nice derivative.
New contributor
edited Mar 23 at 23:32
New contributor
answered Mar 23 at 18:58
komunistbakkalkomunistbakkal
112
112
New contributor
New contributor
add a comment |
add a comment |
$begingroup$
This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.
I will explain this by using an example with $N$ images, which are constituted by $n_1$ images from the class $mathcalC_1$, $n_2$ images from the class $mathcalC_2$, ..., and $n_K$ images from the class $mathcalC_K$. Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' $E_k$ to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.
The mean energy $barE$ is related to the other energies $E_k$ by the following relationship
beginequation
NbarE = sum_k=1^K n_k E_k.qquad (*)
labeleq:mean_energy
endequation
At the same time, we see that the total amount of images can be calculated as the following sum
beginequation
N = sum_k=1^Kn_k.qquad (**)
labeleq:conservation_of_particles
endequation
The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class $n_1$ it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.
The entropy increases with the number of combinations in which we can split the $N$ images into the $n_1$, $n_2$, ..., $n_K$ image classes with corresponding energy. This number of combinations is given by the multinomial coefficient
beginequation
beginpmatrix
N!\
n_1!,n_2!,ldots,n_K!\
endpmatrix=dfracN!prod_k=1^K n_k!.
endequation
We will try to maximize this number assuming that we have infinitely many images $Nto infty$. But his maximization has also equality constraints $(*)$ and $(**)$. This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers $beta$ and $alpha$ for the equality constraints and we introduce the Lagrange Funktion $mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right)$.
beginequation
mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right) = dfracN!prod_k=1^Kn_k!+betaleft[sum_k=1^Kn_k E_k - NbarEright]+alphaleft[N-sum_k=1^K n_kright]
endequation
As we assumed $Nto infty$ we can also assume $n_k to infty$ and use the Stirling approximation for the factorial
beginequation
ln n! = nln n - n + mathcalO(ln n).
endequation
Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to $ln n!$ for $nto infty$.
The partial derivative of the Lagrange function with respect $n_tildek$ will result in
$$dfracpartial mathcalLpartial n_tildek=-ln n_tildek-1-alpha+beta E_tildek.$$
If we set this partial derivative to zero we can find
$$n_tildek=dfracexp(beta E_tildek)exp(1+alpha). qquad (***)$$
If we put this back into $(**)$ we can obtain
$$exp(1+alpha)=dfrac1Nsum_k=1^Kexp(beta E_k).$$
If we put this back into $(***)$ we get something that should remind us of the softmax function
$$n_tildek=dfracexp(beta E_tildek)dfrac1Nsum_k=1^Kexp(beta E_k).$$
If we define $n_tildek/N$ as the probability of class $mathcalC_tildek$ by $p_tildek$ we will obtain something that is really similar to the softmax function
$$p_tildek=dfracexp(beta E_tildek)sum_k=1^Kexp(beta E_k).$$
Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set $beta E_tildek=boldsymbolw^T_kboldsymbolx$ we exactly get the definition of the softmax function for the $k^textth$ output.
$endgroup$
add a comment |
$begingroup$
This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.
I will explain this by using an example with $N$ images, which are constituted by $n_1$ images from the class $mathcalC_1$, $n_2$ images from the class $mathcalC_2$, ..., and $n_K$ images from the class $mathcalC_K$. Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' $E_k$ to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.
The mean energy $barE$ is related to the other energies $E_k$ by the following relationship
beginequation
NbarE = sum_k=1^K n_k E_k.qquad (*)
labeleq:mean_energy
endequation
At the same time, we see that the total amount of images can be calculated as the following sum
beginequation
N = sum_k=1^Kn_k.qquad (**)
labeleq:conservation_of_particles
endequation
The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class $n_1$ it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.
The entropy increases with the number of combinations in which we can split the $N$ images into the $n_1$, $n_2$, ..., $n_K$ image classes with corresponding energy. This number of combinations is given by the multinomial coefficient
beginequation
beginpmatrix
N!\
n_1!,n_2!,ldots,n_K!\
endpmatrix=dfracN!prod_k=1^K n_k!.
endequation
We will try to maximize this number assuming that we have infinitely many images $Nto infty$. But his maximization has also equality constraints $(*)$ and $(**)$. This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers $beta$ and $alpha$ for the equality constraints and we introduce the Lagrange Funktion $mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right)$.
beginequation
mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right) = dfracN!prod_k=1^Kn_k!+betaleft[sum_k=1^Kn_k E_k - NbarEright]+alphaleft[N-sum_k=1^K n_kright]
endequation
As we assumed $Nto infty$ we can also assume $n_k to infty$ and use the Stirling approximation for the factorial
beginequation
ln n! = nln n - n + mathcalO(ln n).
endequation
Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to $ln n!$ for $nto infty$.
The partial derivative of the Lagrange function with respect $n_tildek$ will result in
$$dfracpartial mathcalLpartial n_tildek=-ln n_tildek-1-alpha+beta E_tildek.$$
If we set this partial derivative to zero we can find
$$n_tildek=dfracexp(beta E_tildek)exp(1+alpha). qquad (***)$$
If we put this back into $(**)$ we can obtain
$$exp(1+alpha)=dfrac1Nsum_k=1^Kexp(beta E_k).$$
If we put this back into $(***)$ we get something that should remind us of the softmax function
$$n_tildek=dfracexp(beta E_tildek)dfrac1Nsum_k=1^Kexp(beta E_k).$$
If we define $n_tildek/N$ as the probability of class $mathcalC_tildek$ by $p_tildek$ we will obtain something that is really similar to the softmax function
$$p_tildek=dfracexp(beta E_tildek)sum_k=1^Kexp(beta E_k).$$
Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set $beta E_tildek=boldsymbolw^T_kboldsymbolx$ we exactly get the definition of the softmax function for the $k^textth$ output.
$endgroup$
add a comment |
$begingroup$
This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.
I will explain this by using an example with $N$ images, which are constituted by $n_1$ images from the class $mathcalC_1$, $n_2$ images from the class $mathcalC_2$, ..., and $n_K$ images from the class $mathcalC_K$. Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' $E_k$ to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.
The mean energy $barE$ is related to the other energies $E_k$ by the following relationship
beginequation
NbarE = sum_k=1^K n_k E_k.qquad (*)
labeleq:mean_energy
endequation
At the same time, we see that the total amount of images can be calculated as the following sum
beginequation
N = sum_k=1^Kn_k.qquad (**)
labeleq:conservation_of_particles
endequation
The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class $n_1$ it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.
The entropy increases with the number of combinations in which we can split the $N$ images into the $n_1$, $n_2$, ..., $n_K$ image classes with corresponding energy. This number of combinations is given by the multinomial coefficient
beginequation
beginpmatrix
N!\
n_1!,n_2!,ldots,n_K!\
endpmatrix=dfracN!prod_k=1^K n_k!.
endequation
We will try to maximize this number assuming that we have infinitely many images $Nto infty$. But his maximization has also equality constraints $(*)$ and $(**)$. This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers $beta$ and $alpha$ for the equality constraints and we introduce the Lagrange Funktion $mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right)$.
beginequation
mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right) = dfracN!prod_k=1^Kn_k!+betaleft[sum_k=1^Kn_k E_k - NbarEright]+alphaleft[N-sum_k=1^K n_kright]
endequation
As we assumed $Nto infty$ we can also assume $n_k to infty$ and use the Stirling approximation for the factorial
beginequation
ln n! = nln n - n + mathcalO(ln n).
endequation
Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to $ln n!$ for $nto infty$.
The partial derivative of the Lagrange function with respect $n_tildek$ will result in
$$dfracpartial mathcalLpartial n_tildek=-ln n_tildek-1-alpha+beta E_tildek.$$
If we set this partial derivative to zero we can find
$$n_tildek=dfracexp(beta E_tildek)exp(1+alpha). qquad (***)$$
If we put this back into $(**)$ we can obtain
$$exp(1+alpha)=dfrac1Nsum_k=1^Kexp(beta E_k).$$
If we put this back into $(***)$ we get something that should remind us of the softmax function
$$n_tildek=dfracexp(beta E_tildek)dfrac1Nsum_k=1^Kexp(beta E_k).$$
If we define $n_tildek/N$ as the probability of class $mathcalC_tildek$ by $p_tildek$ we will obtain something that is really similar to the softmax function
$$p_tildek=dfracexp(beta E_tildek)sum_k=1^Kexp(beta E_k).$$
Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set $beta E_tildek=boldsymbolw^T_kboldsymbolx$ we exactly get the definition of the softmax function for the $k^textth$ output.
$endgroup$
This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.
I will explain this by using an example with $N$ images, which are constituted by $n_1$ images from the class $mathcalC_1$, $n_2$ images from the class $mathcalC_2$, ..., and $n_K$ images from the class $mathcalC_K$. Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' $E_k$ to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.
The mean energy $barE$ is related to the other energies $E_k$ by the following relationship
beginequation
NbarE = sum_k=1^K n_k E_k.qquad (*)
labeleq:mean_energy
endequation
At the same time, we see that the total amount of images can be calculated as the following sum
beginequation
N = sum_k=1^Kn_k.qquad (**)
labeleq:conservation_of_particles
endequation
The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class $n_1$ it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.
The entropy increases with the number of combinations in which we can split the $N$ images into the $n_1$, $n_2$, ..., $n_K$ image classes with corresponding energy. This number of combinations is given by the multinomial coefficient
beginequation
beginpmatrix
N!\
n_1!,n_2!,ldots,n_K!\
endpmatrix=dfracN!prod_k=1^K n_k!.
endequation
We will try to maximize this number assuming that we have infinitely many images $Nto infty$. But his maximization has also equality constraints $(*)$ and $(**)$. This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers $beta$ and $alpha$ for the equality constraints and we introduce the Lagrange Funktion $mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right)$.
beginequation
mathcalLleft(n_1,n_2,ldots,n_k;alpha, beta right) = dfracN!prod_k=1^Kn_k!+betaleft[sum_k=1^Kn_k E_k - NbarEright]+alphaleft[N-sum_k=1^K n_kright]
endequation
As we assumed $Nto infty$ we can also assume $n_k to infty$ and use the Stirling approximation for the factorial
beginequation
ln n! = nln n - n + mathcalO(ln n).
endequation
Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to $ln n!$ for $nto infty$.
The partial derivative of the Lagrange function with respect $n_tildek$ will result in
$$dfracpartial mathcalLpartial n_tildek=-ln n_tildek-1-alpha+beta E_tildek.$$
If we set this partial derivative to zero we can find
$$n_tildek=dfracexp(beta E_tildek)exp(1+alpha). qquad (***)$$
If we put this back into $(**)$ we can obtain
$$exp(1+alpha)=dfrac1Nsum_k=1^Kexp(beta E_k).$$
If we put this back into $(***)$ we get something that should remind us of the softmax function
$$n_tildek=dfracexp(beta E_tildek)dfrac1Nsum_k=1^Kexp(beta E_k).$$
If we define $n_tildek/N$ as the probability of class $mathcalC_tildek$ by $p_tildek$ we will obtain something that is really similar to the softmax function
$$p_tildek=dfracexp(beta E_tildek)sum_k=1^Kexp(beta E_k).$$
Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set $beta E_tildek=boldsymbolw^T_kboldsymbolx$ we exactly get the definition of the softmax function for the $k^textth$ output.
edited Mar 24 at 11:10
answered Mar 24 at 8:49
MachineLearnerMachineLearner
35410
35410
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f23159%2fin-softmax-classifier-why-use-exp-function-to-do-normalization%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
It is differentiable, leads to non-negative results (such as would be necessary for a probability so the cross-entropy can be calculated), and behaves like the max function, which is appropriate in a classification setting. Welcome to the site!
$endgroup$
– Emre
Sep 20 '17 at 6:09
$begingroup$
@Emre Thanks! But what does "behaves like max function" mean? Besides, if I have another function that is also differentiable, monotone increasing and leads to non-negative results, can I use it to replace the exp function in the formula?
$endgroup$
– Hans
Sep 20 '17 at 7:12
$begingroup$
When you normalize using $max$, the greatest argument gets mapped to 1 while the rest get mapped to zero, owing to the growth of the exponential fuction.
$endgroup$
– Emre
Sep 20 '17 at 16:02