Probability Calibration : role of hidden layer in Neural Network The Next CEO of Stack Overflow2019 Community Moderator ElectionHow can the performance of a neural network vary considerably without changing any parameters?Properties for building a Multilayer Perceptron Neural Network using Keras?Issue with backpropagation using a 2 layer network and softmaxProperly using activation functions of neural networkWhat is wrong in this Deep Neural network.?Neural network example not working with sigmoid activation functionAdding more layers decreases accuracyWeights initialization in Neural NetworkCalculating sin function with neural network using pythonMulti label classification and sigmoid function

WOW air has ceased operation, can I get my tickets refunded?

I believe this to be a fraud - hired, then asked to cash check and send cash as Bitcoin

Are there any unintended negative consequences to allowing PCs to gain multiple levels at once in a short milestone-XP game?

Plot of histogram similar to output from @risk

Rotate a column

Won the lottery - how do I keep the money?

Is there a way to bypass a component in series in a circuit if that component fails?

Would a completely good Muggle be able to use a wand?

Is French Guiana a (hard) EU border?

How to get the end in algorithm2e

What flight has the highest ratio of time difference to flight time?

Why is information "lost" when it got into a black hole?

Why does standard notation not preserve intervals (visually)

Beveled cylinder cutout

Why don't programming languages automatically manage the synchronous/asynchronous problem?

Can we say or write : "No, it'sn't"?

Is this "being" usage is essential?

Is it professional to write unrelated content in an almost-empty email?

What happened in Rome, when the western empire "fell"?

Is it okay to majorly distort historical facts while writing a fiction story?

Can MTA send mail via a relay without being told so?

What is the purpose of the Evocation wizard's Potent Cantrip feature?

Why doesn't UK go for the same deal Japan has with EU to resolve Brexit?

unclear about Dynamic Binding



Probability Calibration : role of hidden layer in Neural Network



The Next CEO of Stack Overflow
2019 Community Moderator ElectionHow can the performance of a neural network vary considerably without changing any parameters?Properties for building a Multilayer Perceptron Neural Network using Keras?Issue with backpropagation using a 2 layer network and softmaxProperly using activation functions of neural networkWhat is wrong in this Deep Neural network.?Neural network example not working with sigmoid activation functionAdding more layers decreases accuracyWeights initialization in Neural NetworkCalculating sin function with neural network using pythonMulti label classification and sigmoid function










4












$begingroup$


I try a simple Neural Network (Logistic Regression) to play with Keras.
In input I have 5,000 features (output of a simple tf-idf vectorizer) and in the output layer I just use a random uniform initialization and an $alpha = 0.0001$ for $L_2$ regularization with a sigmoid activation function (so something pretty standard).



I use an Adam optimizer and the loss function is the binary cross-entropy.



When I display the probability for both classes I end up with something like this :



enter image description here



After that I try to add a hidden layer with a $Relu$ activation function, 64 nodes and again the same parameters for regularization & initialization.
-- EDIT --
Here for the output layer I keep exactly the same parameter (and $sigmoid$ activation function) as in the previous NN.
-- END OF EDIT --



But now when I plot the probability distribution for both classes I end up like this :



enter image description here



And cleary I didn't get why by adding a hidden layer the probability are push to 0 or 1 ?



Have you any references that I can read so I can understand the math behind ? It could be great!



Moreover sometimes (with a more "deeper" neural network for another application) I get the same plot as the second one but this time the predicted probabilities are between $[0.2; 0.8]$. The probabilities are push to some "values" but those values are more "centered". Not sure it is clear. And here again I don't understand what is the phenomena behind this ? How can I "debug" it to see in my architecture of my Neural Network what is the cause ?



In addition How can I "tweak" my ANN to have a "perfect" calibration plot (like in the scikit webpage : https://scikit-learn.org/stable/modules/calibration.html) ?



Thank you in advance for every answer that can enligthen me :)










share|improve this question











$endgroup$
















    4












    $begingroup$


    I try a simple Neural Network (Logistic Regression) to play with Keras.
    In input I have 5,000 features (output of a simple tf-idf vectorizer) and in the output layer I just use a random uniform initialization and an $alpha = 0.0001$ for $L_2$ regularization with a sigmoid activation function (so something pretty standard).



    I use an Adam optimizer and the loss function is the binary cross-entropy.



    When I display the probability for both classes I end up with something like this :



    enter image description here



    After that I try to add a hidden layer with a $Relu$ activation function, 64 nodes and again the same parameters for regularization & initialization.
    -- EDIT --
    Here for the output layer I keep exactly the same parameter (and $sigmoid$ activation function) as in the previous NN.
    -- END OF EDIT --



    But now when I plot the probability distribution for both classes I end up like this :



    enter image description here



    And cleary I didn't get why by adding a hidden layer the probability are push to 0 or 1 ?



    Have you any references that I can read so I can understand the math behind ? It could be great!



    Moreover sometimes (with a more "deeper" neural network for another application) I get the same plot as the second one but this time the predicted probabilities are between $[0.2; 0.8]$. The probabilities are push to some "values" but those values are more "centered". Not sure it is clear. And here again I don't understand what is the phenomena behind this ? How can I "debug" it to see in my architecture of my Neural Network what is the cause ?



    In addition How can I "tweak" my ANN to have a "perfect" calibration plot (like in the scikit webpage : https://scikit-learn.org/stable/modules/calibration.html) ?



    Thank you in advance for every answer that can enligthen me :)










    share|improve this question











    $endgroup$














      4












      4








      4


      1



      $begingroup$


      I try a simple Neural Network (Logistic Regression) to play with Keras.
      In input I have 5,000 features (output of a simple tf-idf vectorizer) and in the output layer I just use a random uniform initialization and an $alpha = 0.0001$ for $L_2$ regularization with a sigmoid activation function (so something pretty standard).



      I use an Adam optimizer and the loss function is the binary cross-entropy.



      When I display the probability for both classes I end up with something like this :



      enter image description here



      After that I try to add a hidden layer with a $Relu$ activation function, 64 nodes and again the same parameters for regularization & initialization.
      -- EDIT --
      Here for the output layer I keep exactly the same parameter (and $sigmoid$ activation function) as in the previous NN.
      -- END OF EDIT --



      But now when I plot the probability distribution for both classes I end up like this :



      enter image description here



      And cleary I didn't get why by adding a hidden layer the probability are push to 0 or 1 ?



      Have you any references that I can read so I can understand the math behind ? It could be great!



      Moreover sometimes (with a more "deeper" neural network for another application) I get the same plot as the second one but this time the predicted probabilities are between $[0.2; 0.8]$. The probabilities are push to some "values" but those values are more "centered". Not sure it is clear. And here again I don't understand what is the phenomena behind this ? How can I "debug" it to see in my architecture of my Neural Network what is the cause ?



      In addition How can I "tweak" my ANN to have a "perfect" calibration plot (like in the scikit webpage : https://scikit-learn.org/stable/modules/calibration.html) ?



      Thank you in advance for every answer that can enligthen me :)










      share|improve this question











      $endgroup$




      I try a simple Neural Network (Logistic Regression) to play with Keras.
      In input I have 5,000 features (output of a simple tf-idf vectorizer) and in the output layer I just use a random uniform initialization and an $alpha = 0.0001$ for $L_2$ regularization with a sigmoid activation function (so something pretty standard).



      I use an Adam optimizer and the loss function is the binary cross-entropy.



      When I display the probability for both classes I end up with something like this :



      enter image description here



      After that I try to add a hidden layer with a $Relu$ activation function, 64 nodes and again the same parameters for regularization & initialization.
      -- EDIT --
      Here for the output layer I keep exactly the same parameter (and $sigmoid$ activation function) as in the previous NN.
      -- END OF EDIT --



      But now when I plot the probability distribution for both classes I end up like this :



      enter image description here



      And cleary I didn't get why by adding a hidden layer the probability are push to 0 or 1 ?



      Have you any references that I can read so I can understand the math behind ? It could be great!



      Moreover sometimes (with a more "deeper" neural network for another application) I get the same plot as the second one but this time the predicted probabilities are between $[0.2; 0.8]$. The probabilities are push to some "values" but those values are more "centered". Not sure it is clear. And here again I don't understand what is the phenomena behind this ? How can I "debug" it to see in my architecture of my Neural Network what is the cause ?



      In addition How can I "tweak" my ANN to have a "perfect" calibration plot (like in the scikit webpage : https://scikit-learn.org/stable/modules/calibration.html) ?



      Thank you in advance for every answer that can enligthen me :)







      neural-network keras






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 20 '18 at 9:23







      BimBimBap

















      asked Nov 19 '18 at 19:03









      BimBimBapBimBimBap

      312




      312




















          2 Answers
          2






          active

          oldest

          votes


















          0












          $begingroup$

          It makes sense that a network with ReLU activations produces worse probabilities than a sigmoid activation when you think about how these activation functions interact with the unit in the output layer of the network. Sigmoid scales the output to be in the range $[0, 1]$, so the values going into the final sigmoid classification unit will be fairly low. Meanwhile, the ReLU-based network has the potential for very high activation magnitudes, which will skew the estimated probabilities to either end of the spectrum.



          Having said that, calibration of neural networks is not very well understood in general. A paper was published last year at ICML (https://arxiv.org/pdf/1706.04599.pdf) which showed that often neural nets are poorly calibrated, but it doesn't really explain why this is the case. However, they did investigate factors that affect calibration, and show that deeper, wider networks can often be less well calibrated than their shallower, narrower counterparts. They also show that applying $L_2$ regularisation can improve calibration.



          Given this information, I'm surprised that your (fairly shallow and narrow) network is so poorly calibrated. It might be overfitted to NLL, so I would try applying heavier regularisation or training for fewer epochs. If this doesn't work, try implementing temperature scaling, which is described in the paper I linked above. They provide a PyTorch implementation here, so it shouldn't be too difficult hopefully to translate into Keras.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.
            $endgroup$
            – BimBimBap
            Nov 20 '18 at 9:08











          • $begingroup$
            I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)
            $endgroup$
            – BimBimBap
            Nov 20 '18 at 9:24


















          0












          $begingroup$

          I spent way too long attempting to calibrate my keras probability outputs. It turned out to be very simple.



          The isotonic model does a great job (so much so it risks overfitting)



          from sklearn.isotonic import IsotonicRegression
          ir = IsotonicRegression()
          ir.fit(results.pred,results.act)
          results['iso'] = ir.predict(results.pred)


          This blog post has more details and some useful code for calibration plots. And best thing is the sum of my probabilities adds up to same as the number of cases.



          Obviously you should validate this by training the ir model on your training set and then apply it to your networks predictions on the test data. I've been very happy with the performance measured by Brier scores and AUC.



          You'll see references to Platt scaling. As far as I can tell isotonic scaling is superior on large datasets.






          share|improve this answer










          New contributor




          ErichBSchulz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41426%2fprobability-calibration-role-of-hidden-layer-in-neural-network%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            It makes sense that a network with ReLU activations produces worse probabilities than a sigmoid activation when you think about how these activation functions interact with the unit in the output layer of the network. Sigmoid scales the output to be in the range $[0, 1]$, so the values going into the final sigmoid classification unit will be fairly low. Meanwhile, the ReLU-based network has the potential for very high activation magnitudes, which will skew the estimated probabilities to either end of the spectrum.



            Having said that, calibration of neural networks is not very well understood in general. A paper was published last year at ICML (https://arxiv.org/pdf/1706.04599.pdf) which showed that often neural nets are poorly calibrated, but it doesn't really explain why this is the case. However, they did investigate factors that affect calibration, and show that deeper, wider networks can often be less well calibrated than their shallower, narrower counterparts. They also show that applying $L_2$ regularisation can improve calibration.



            Given this information, I'm surprised that your (fairly shallow and narrow) network is so poorly calibrated. It might be overfitted to NLL, so I would try applying heavier regularisation or training for fewer epochs. If this doesn't work, try implementing temperature scaling, which is described in the paper I linked above. They provide a PyTorch implementation here, so it shouldn't be too difficult hopefully to translate into Keras.






            share|improve this answer









            $endgroup$












            • $begingroup$
              Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.
              $endgroup$
              – BimBimBap
              Nov 20 '18 at 9:08











            • $begingroup$
              I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)
              $endgroup$
              – BimBimBap
              Nov 20 '18 at 9:24















            0












            $begingroup$

            It makes sense that a network with ReLU activations produces worse probabilities than a sigmoid activation when you think about how these activation functions interact with the unit in the output layer of the network. Sigmoid scales the output to be in the range $[0, 1]$, so the values going into the final sigmoid classification unit will be fairly low. Meanwhile, the ReLU-based network has the potential for very high activation magnitudes, which will skew the estimated probabilities to either end of the spectrum.



            Having said that, calibration of neural networks is not very well understood in general. A paper was published last year at ICML (https://arxiv.org/pdf/1706.04599.pdf) which showed that often neural nets are poorly calibrated, but it doesn't really explain why this is the case. However, they did investigate factors that affect calibration, and show that deeper, wider networks can often be less well calibrated than their shallower, narrower counterparts. They also show that applying $L_2$ regularisation can improve calibration.



            Given this information, I'm surprised that your (fairly shallow and narrow) network is so poorly calibrated. It might be overfitted to NLL, so I would try applying heavier regularisation or training for fewer epochs. If this doesn't work, try implementing temperature scaling, which is described in the paper I linked above. They provide a PyTorch implementation here, so it shouldn't be too difficult hopefully to translate into Keras.






            share|improve this answer









            $endgroup$












            • $begingroup$
              Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.
              $endgroup$
              – BimBimBap
              Nov 20 '18 at 9:08











            • $begingroup$
              I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)
              $endgroup$
              – BimBimBap
              Nov 20 '18 at 9:24













            0












            0








            0





            $begingroup$

            It makes sense that a network with ReLU activations produces worse probabilities than a sigmoid activation when you think about how these activation functions interact with the unit in the output layer of the network. Sigmoid scales the output to be in the range $[0, 1]$, so the values going into the final sigmoid classification unit will be fairly low. Meanwhile, the ReLU-based network has the potential for very high activation magnitudes, which will skew the estimated probabilities to either end of the spectrum.



            Having said that, calibration of neural networks is not very well understood in general. A paper was published last year at ICML (https://arxiv.org/pdf/1706.04599.pdf) which showed that often neural nets are poorly calibrated, but it doesn't really explain why this is the case. However, they did investigate factors that affect calibration, and show that deeper, wider networks can often be less well calibrated than their shallower, narrower counterparts. They also show that applying $L_2$ regularisation can improve calibration.



            Given this information, I'm surprised that your (fairly shallow and narrow) network is so poorly calibrated. It might be overfitted to NLL, so I would try applying heavier regularisation or training for fewer epochs. If this doesn't work, try implementing temperature scaling, which is described in the paper I linked above. They provide a PyTorch implementation here, so it shouldn't be too difficult hopefully to translate into Keras.






            share|improve this answer









            $endgroup$



            It makes sense that a network with ReLU activations produces worse probabilities than a sigmoid activation when you think about how these activation functions interact with the unit in the output layer of the network. Sigmoid scales the output to be in the range $[0, 1]$, so the values going into the final sigmoid classification unit will be fairly low. Meanwhile, the ReLU-based network has the potential for very high activation magnitudes, which will skew the estimated probabilities to either end of the spectrum.



            Having said that, calibration of neural networks is not very well understood in general. A paper was published last year at ICML (https://arxiv.org/pdf/1706.04599.pdf) which showed that often neural nets are poorly calibrated, but it doesn't really explain why this is the case. However, they did investigate factors that affect calibration, and show that deeper, wider networks can often be less well calibrated than their shallower, narrower counterparts. They also show that applying $L_2$ regularisation can improve calibration.



            Given this information, I'm surprised that your (fairly shallow and narrow) network is so poorly calibrated. It might be overfitted to NLL, so I would try applying heavier regularisation or training for fewer epochs. If this doesn't work, try implementing temperature scaling, which is described in the paper I linked above. They provide a PyTorch implementation here, so it shouldn't be too difficult hopefully to translate into Keras.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 19 '18 at 22:33









            timleatharttimleathart

            2,354929




            2,354929











            • $begingroup$
              Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.
              $endgroup$
              – BimBimBap
              Nov 20 '18 at 9:08











            • $begingroup$
              I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)
              $endgroup$
              – BimBimBap
              Nov 20 '18 at 9:24
















            • $begingroup$
              Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.
              $endgroup$
              – BimBimBap
              Nov 20 '18 at 9:08











            • $begingroup$
              I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)
              $endgroup$
              – BimBimBap
              Nov 20 '18 at 9:24















            $begingroup$
            Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.
            $endgroup$
            – BimBimBap
            Nov 20 '18 at 9:08





            $begingroup$
            Thank for your answer. Here I don't get the point with the Relu activation since in my output layer it is always a sigmoid function ? But it is interesting that you too are surprised by the result (it is not a deep NN) and I think (I will check again) than there is no overfitting. I will look at the paper (thank for the reference) & at the temperature scaling.
            $endgroup$
            – BimBimBap
            Nov 20 '18 at 9:08













            $begingroup$
            I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)
            $endgroup$
            – BimBimBap
            Nov 20 '18 at 9:24




            $begingroup$
            I clarify my first post for the output layer (they are always the same, the only change is the add of a hidden layer)
            $endgroup$
            – BimBimBap
            Nov 20 '18 at 9:24











            0












            $begingroup$

            I spent way too long attempting to calibrate my keras probability outputs. It turned out to be very simple.



            The isotonic model does a great job (so much so it risks overfitting)



            from sklearn.isotonic import IsotonicRegression
            ir = IsotonicRegression()
            ir.fit(results.pred,results.act)
            results['iso'] = ir.predict(results.pred)


            This blog post has more details and some useful code for calibration plots. And best thing is the sum of my probabilities adds up to same as the number of cases.



            Obviously you should validate this by training the ir model on your training set and then apply it to your networks predictions on the test data. I've been very happy with the performance measured by Brier scores and AUC.



            You'll see references to Platt scaling. As far as I can tell isotonic scaling is superior on large datasets.






            share|improve this answer










            New contributor




            ErichBSchulz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$

















              0












              $begingroup$

              I spent way too long attempting to calibrate my keras probability outputs. It turned out to be very simple.



              The isotonic model does a great job (so much so it risks overfitting)



              from sklearn.isotonic import IsotonicRegression
              ir = IsotonicRegression()
              ir.fit(results.pred,results.act)
              results['iso'] = ir.predict(results.pred)


              This blog post has more details and some useful code for calibration plots. And best thing is the sum of my probabilities adds up to same as the number of cases.



              Obviously you should validate this by training the ir model on your training set and then apply it to your networks predictions on the test data. I've been very happy with the performance measured by Brier scores and AUC.



              You'll see references to Platt scaling. As far as I can tell isotonic scaling is superior on large datasets.






              share|improve this answer










              New contributor




              ErichBSchulz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$















                0












                0








                0





                $begingroup$

                I spent way too long attempting to calibrate my keras probability outputs. It turned out to be very simple.



                The isotonic model does a great job (so much so it risks overfitting)



                from sklearn.isotonic import IsotonicRegression
                ir = IsotonicRegression()
                ir.fit(results.pred,results.act)
                results['iso'] = ir.predict(results.pred)


                This blog post has more details and some useful code for calibration plots. And best thing is the sum of my probabilities adds up to same as the number of cases.



                Obviously you should validate this by training the ir model on your training set and then apply it to your networks predictions on the test data. I've been very happy with the performance measured by Brier scores and AUC.



                You'll see references to Platt scaling. As far as I can tell isotonic scaling is superior on large datasets.






                share|improve this answer










                New contributor




                ErichBSchulz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$



                I spent way too long attempting to calibrate my keras probability outputs. It turned out to be very simple.



                The isotonic model does a great job (so much so it risks overfitting)



                from sklearn.isotonic import IsotonicRegression
                ir = IsotonicRegression()
                ir.fit(results.pred,results.act)
                results['iso'] = ir.predict(results.pred)


                This blog post has more details and some useful code for calibration plots. And best thing is the sum of my probabilities adds up to same as the number of cases.



                Obviously you should validate this by training the ir model on your training set and then apply it to your networks predictions on the test data. I've been very happy with the performance measured by Brier scores and AUC.



                You'll see references to Platt scaling. As far as I can tell isotonic scaling is superior on large datasets.







                share|improve this answer










                New contributor




                ErichBSchulz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                share|improve this answer



                share|improve this answer








                edited Mar 24 at 8:31





















                New contributor




                ErichBSchulz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                answered Mar 24 at 8:25









                ErichBSchulzErichBSchulz

                1013




                1013




                New contributor




                ErichBSchulz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                New contributor





                ErichBSchulz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                ErichBSchulz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41426%2fprobability-calibration-role-of-hidden-layer-in-neural-network%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                    Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

                    Do these cracks on my tires look bad? The Next CEO of Stack OverflowDry rot tire should I replace?Having to replace tiresFishtailed so easily? Bad tires? ABS?Filling the tires with something other than air, to avoid puncture hassles?Used Michelin tires safe to install?Do these tyre cracks necessitate replacement?Rumbling noise: tires or mechanicalIs it possible to fix noisy feathered tires?Are bad winter tires still better than summer tires in winter?Torque converter failure - Related to replacing only 2 tires?Why use snow tires on all 4 wheels on 2-wheel-drive cars?