Probability Calibration : role of hidden layer in Neural Network The Next CEO of Stack Overflow2019 Community Moderator ElectionHow can the performance of a neural network vary considerably without changing any parameters?Properties for building a Multilayer Perceptron Neural Network using Keras?Issue with backpropagation using a 2 layer network and softmaxProperly using activation functions of neural networkWhat is wrong in this Deep Neural network.?Neural network example not working with sigmoid activation functionAdding more layers decreases accuracyWeights initialization in Neural NetworkCalculating sin function with neural network using pythonMulti label classification and sigmoid function

WOW air has ceased operation, can I get my tickets refunded?

I believe this to be a fraud - hired, then asked to cash check and send cash as Bitcoin

Are there any unintended negative consequences to allowing PCs to gain multiple levels at once in a short milestone-XP game?

Plot of histogram similar to output from @risk

Rotate a column

Won the lottery - how do I keep the money?

Is there a way to bypass a component in series in a circuit if that component fails?

Would a completely good Muggle be able to use a wand?

Is French Guiana a (hard) EU border?

How to get the end in algorithm2e

What flight has the highest ratio of time difference to flight time?

Why is information "lost" when it got into a black hole?

Why does standard notation not preserve intervals (visually)

Beveled cylinder cutout

Why don't programming languages automatically manage the synchronous/asynchronous problem?

Can we say or write : "No, it'sn't"?

Is this "being" usage is essential?

Is it professional to write unrelated content in an almost-empty email?

What happened in Rome, when the western empire "fell"?

Is it okay to majorly distort historical facts while writing a fiction story?

Can MTA send mail via a relay without being told so?

What is the purpose of the Evocation wizard's Potent Cantrip feature?

Why doesn't UK go for the same deal Japan has with EU to resolve Brexit?

unclear about Dynamic Binding

Probability Calibration : role of hidden layer in Neural Network

The Next CEO of Stack Overflow

2019 Community Moderator ElectionHow can the performance of a neural network vary considerably without changing any parameters?Properties for building a Multilayer Perceptron Neural Network using Keras?Issue with backpropagation using a 2 layer network and softmaxProperly using activation functions of neural networkWhat is wrong in this Deep Neural network.?Neural network example not working with sigmoid activation functionAdding more layers decreases accuracyWeights initialization in Neural NetworkCalculating sin function with neural network using pythonMulti label classification and sigmoid function

I try a simple Neural Network (Logistic Regression) to play with Keras.
In input I have 5,000 features (output of a simple tf-idf vectorizer) and in the output layer I just use a random uniform initialization and an $alpha = 0.0001$ for $L_2$ regularization with a sigmoid activation function (so something pretty standard).

I use an Adam optimizer and the loss function is the binary cross-entropy.

When I display the probability for both classes I end up with something like this :

enter image description here

After that I try to add a hidden layer with a $Relu$ activation function, 64 nodes and again the same parameters for regularization & initialization.
-- EDIT --
Here for the output layer I keep exactly the same parameter (and $sigmoid$ activation function) as in the previous NN.
-- END OF EDIT --

But now when I plot the probability distribution for both classes I end up like this :

enter image description here

And cleary I didn't get why by adding a hidden layer the probability are push to 0 or 1 ?

Have you any references that I can read so I can understand the math behind ? It could be great!

Moreover sometimes (with a more "deeper" neural network for another application) I get the same plot as the second one but this time the predicted probabilities are between $[0.2; 0.8]$. The probabilities are push to some "values" but those values are more "centered". Not sure it is clear. And here again I don't understand what is the phenomena behind this ? How can I "debug" it to see in my architecture of my Neural Network what is the cause ?

In addition How can I "tweak" my ANN to have a "perfect" calibration plot (like in the scikit webpage : https://scikit-learn.org/stable/modules/calibration.html) ?

Thank you in advance for every answer that can enligthen me :)

edited Nov 20 '18 at 9:23

asked Nov 19 '18 at 19:03

BimBimBap

312

add a comment |

I use an Adam optimizer and the loss function is the binary cross-entropy.

When I display the probability for both classes I end up with something like this :

enter image description here

But now when I plot the probability distribution for both classes I end up like this :

enter image description here

And cleary I didn't get why by adding a hidden layer the probability are push to 0 or 1 ?

Have you any references that I can read so I can understand the math behind ? It could be great!

In addition How can I "tweak" my ANN to have a "perfect" calibration plot (like in the scikit webpage : https://scikit-learn.org/stable/modules/calibration.html) ?

Thank you in advance for every answer that can enligthen me :)

edited Nov 20 '18 at 9:23

asked Nov 19 '18 at 19:03

BimBimBap

312

add a comment |

I use an Adam optimizer and the loss function is the binary cross-entropy.

When I display the probability for both classes I end up with something like this :

enter image description here

But now when I plot the probability distribution for both classes I end up like this :

enter image description here

And cleary I didn't get why by adding a hidden layer the probability are push to 0 or 1 ?

Have you any references that I can read so I can understand the math behind ? It could be great!

In addition How can I "tweak" my ANN to have a "perfect" calibration plot (like in the scikit webpage : https://scikit-learn.org/stable/modules/calibration.html) ?

Thank you in advance for every answer that can enligthen me :)

edited Nov 20 '18 at 9:23

asked Nov 19 '18 at 19:03

BimBimBap

312

I use an Adam optimizer and the loss function is the binary cross-entropy.

When I display the probability for both classes I end up with something like this :

enter image description here

But now when I plot the probability distribution for both classes I end up like this :

enter image description here

And cleary I didn't get why by adding a hidden layer the probability are push to 0 or 1 ?

Have you any references that I can read so I can understand the math behind ? It could be great!

In addition How can I "tweak" my ANN to have a "perfect" calibration plot (like in the scikit webpage : https://scikit-learn.org/stable/modules/calibration.html) ?

Thank you in advance for every answer that can enligthen me :)

neural-network keras

edited Nov 20 '18 at 9:23

asked Nov 19 '18 at 19:03

BimBimBap

312

edited Nov 20 '18 at 9:23

asked Nov 19 '18 at 19:03

BimBimBap

312

edited Nov 20 '18 at 9:23

asked Nov 19 '18 at 19:03

BimBimBap

312

asked Nov 19 '18 at 19:03

BimBimBap

312

asked Nov 19 '18 at 19:03

BimBimBap

312

add a comment |

2 Answers
2

active

oldest

votes

It makes sense that a network with ReLU activations produces worse probabilities than a sigmoid activation when you think about how these activation functions interact with the unit in the output layer of the network. Sigmoid scales the output to be in the range $[0, 1]$, so the values going into the final sigmoid classification unit will be fairly low. Meanwhile, the ReLU-based network has the potential for very high activation magnitudes, which will skew the estimated probabilities to either end of the spectrum.

Having said that, calibration of neural networks is not very well understood in general. A paper was published last year at ICML (https://arxiv.org/pdf/1706.04599.pdf) which showed that often neural nets are poorly calibrated, but it doesn't really explain why this is the case. However, they did investigate factors that affect calibration, and show that deeper, wider networks can often be less well calibrated than their shallower, narrower counterparts. They also show that applying $L_2$ regularisation can improve calibration.

Given this information, I'm surprised that your (fairly shallow and narrow) network is so poorly calibrated. It might be overfitted to NLL, so I would try applying heavier regularisation or training for fewer epochs. If this doesn't work, try implementing temperature scaling, which is described in the paper I linked above. They provide a PyTorch implementation here, so it shouldn't be too difficult hopefully to translate into Keras.

answered Nov 19 '18 at 22:33

timleathart

2,354929

$begingroup$
Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.
$endgroup$
– BimBimBap
Nov 20 '18 at 9:08

$begingroup$
I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)
$endgroup$
– BimBimBap
Nov 20 '18 at 9:24

add a comment |

I spent way too long attempting to calibrate my keras probability outputs. It turned out to be very simple.

The isotonic model does a great job (so much so it risks overfitting)

from sklearn.isotonic import IsotonicRegression
ir = IsotonicRegression()
ir.fit(results.pred,results.act)
results['iso'] = ir.predict(results.pred)

This blog post has more details and some useful code for calibration plots. And best thing is the sum of my probabilities adds up to same as the number of cases.

Obviously you should validate this by training the ir model on your training set and then apply it to your networks predictions on the test data. I've been very happy with the performance measured by Brier scores and AUC.

You'll see references to Platt scaling. As far as I can tell isotonic scaling is superior on large datasets.

edited Mar 24 at 8:31

answered Mar 24 at 8:25

ErichBSchulz

1013

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41426%2fprobability-calibration-role-of-hidden-layer-in-neural-network%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

answered Nov 19 '18 at 22:33

timleathart

2,354929

$begingroup$
Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.
$endgroup$
– BimBimBap
Nov 20 '18 at 9:08

$begingroup$
I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)
$endgroup$
– BimBimBap
Nov 20 '18 at 9:24

add a comment |

answered Nov 19 '18 at 22:33

timleathart

2,354929

$begingroup$
Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.
$endgroup$
– BimBimBap
Nov 20 '18 at 9:08

$begingroup$
I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)
$endgroup$
– BimBimBap
Nov 20 '18 at 9:24

add a comment |

answered Nov 19 '18 at 22:33

timleathart

2,354929

answered Nov 19 '18 at 22:33

timleathart

2,354929

answered Nov 19 '18 at 22:33

timleathart

2,354929

answered Nov 19 '18 at 22:33

timleathart

2,354929

answered Nov 19 '18 at 22:33

timleathart

2,354929

$begingroup$
Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.
$endgroup$
– BimBimBap
Nov 20 '18 at 9:08

$begingroup$
I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)
$endgroup$
– BimBimBap
Nov 20 '18 at 9:24

add a comment |

$begingroup$
Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.
$endgroup$
– BimBimBap
Nov 20 '18 at 9:08

$begingroup$
I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)
$endgroup$
– BimBimBap
Nov 20 '18 at 9:24

Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.

– BimBimBap
Nov 20 '18 at 9:08

I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)

– BimBimBap
Nov 20 '18 at 9:24

add a comment |

I spent way too long attempting to calibrate my keras probability outputs. It turned out to be very simple.

The isotonic model does a great job (so much so it risks overfitting)

from sklearn.isotonic import IsotonicRegression
ir = IsotonicRegression()
ir.fit(results.pred,results.act)
results['iso'] = ir.predict(results.pred)

This blog post has more details and some useful code for calibration plots. And best thing is the sum of my probabilities adds up to same as the number of cases.

You'll see references to Platt scaling. As far as I can tell isotonic scaling is superior on large datasets.

edited Mar 24 at 8:31

answered Mar 24 at 8:25

ErichBSchulz

1013

New contributor

add a comment |

I spent way too long attempting to calibrate my keras probability outputs. It turned out to be very simple.

The isotonic model does a great job (so much so it risks overfitting)

from sklearn.isotonic import IsotonicRegression
ir = IsotonicRegression()
ir.fit(results.pred,results.act)
results['iso'] = ir.predict(results.pred)

This blog post has more details and some useful code for calibration plots. And best thing is the sum of my probabilities adds up to same as the number of cases.

You'll see references to Platt scaling. As far as I can tell isotonic scaling is superior on large datasets.

edited Mar 24 at 8:31

answered Mar 24 at 8:25

ErichBSchulz

1013

New contributor

add a comment |

I spent way too long attempting to calibrate my keras probability outputs. It turned out to be very simple.

The isotonic model does a great job (so much so it risks overfitting)

from sklearn.isotonic import IsotonicRegression
ir = IsotonicRegression()
ir.fit(results.pred,results.act)
results['iso'] = ir.predict(results.pred)

This blog post has more details and some useful code for calibration plots. And best thing is the sum of my probabilities adds up to same as the number of cases.

You'll see references to Platt scaling. As far as I can tell isotonic scaling is superior on large datasets.

edited Mar 24 at 8:31

answered Mar 24 at 8:25

ErichBSchulz

1013

New contributor

I spent way too long attempting to calibrate my keras probability outputs. It turned out to be very simple.

The isotonic model does a great job (so much so it risks overfitting)

from sklearn.isotonic import IsotonicRegression
ir = IsotonicRegression()
ir.fit(results.pred,results.act)
results['iso'] = ir.predict(results.pred)

This blog post has more details and some useful code for calibration plots. And best thing is the sum of my probabilities adds up to same as the number of cases.

You'll see references to Platt scaling. As far as I can tell isotonic scaling is superior on large datasets.

edited Mar 24 at 8:31

answered Mar 24 at 8:25

ErichBSchulz

1013

New contributor

edited Mar 24 at 8:31

answered Mar 24 at 8:25

ErichBSchulz

1013

New contributor

answered Mar 24 at 8:25

ErichBSchulz

1013

answered Mar 24 at 8:25

ErichBSchulz

1013

New contributor

ErichBSchulz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

nO jGmQafDiHMUDKL,E3ISyab91XkNrcPr,ARdBB3cKkhSXW0 GjG5U ehs8sKBJmJIF 5

搜尋此網誌

Trjtdtk

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers
2

2 Answers
2

2 Answers
2