How to select between models when AUC scores are similar?Generic strategy for object detectionQuestion on reservoir samplingHow can I fix this “convex” problem ? Is it just a matter of overfitting?Possible Reason for low Test accuracy and high AUCHow to evaluate data capability to train a model?Valid Approach to Kaggle's Porto Seguro ML Problem?Significant overfitting with CVStatistical test for machine learningHow to generate data if algo itself is involved in the process with a feedback loop?how to interpret a high AUC value but a low F1 score after upsampling minority class?

Wrapping homogeneous Python objects

What should I install to correct "ld: cannot find -lgbm and -linput" so that I can compile a Rust program?

Comment Box for Substitution Method of Integrals

How to define limit operations in general topological spaces? Are nets able to do this?

Unfrosted light bulb

What does "^L" mean in C?

Is there a hypothetical scenario that would make Earth uninhabitable for humans, but not for (the majority of) other animals?

Have the tides ever turned twice on any open problem?

Calculate the frequency of characters in a string

Existence of a celestial body big enough for early civilization to be thought of as a second moon

In the 1924 version of The Thief of Bagdad, no character is named, right?

Fewest number of steps to reach 200 using special calculator

How can an organ that provides biological immortality be unable to regenerate?

If "dar" means "to give", what does "daros" mean?

Do US professors/group leaders only get a salary, but no group budget?

What is the relationship between relativity and the Doppler effect?

I got the following comment from a reputed math journal. What does it mean?

Do native speakers use "ultima" and "proxima" frequently in spoken English?

Relation between independence and correlation of uniform random variables

Maths symbols and unicode-math input inside siunitx commands

Am I eligible for the Eurail Youth pass? I am 27.5 years old

Can a wizard cast a spell during their first turn of combat if they initiated combat by releasing a readied spell?

Knife as defense against stray dogs

How to terminate ping <dest> &

How to select between models when AUC scores are similar?

Generic strategy for object detectionQuestion on reservoir samplingHow can I fix this “convex” problem ? Is it just a matter of overfitting?Possible Reason for low Test accuracy and high AUCHow to evaluate data capability to train a model?Valid Approach to Kaggle's Porto Seguro ML Problem?Significant overfitting with CVStatistical test for machine learningHow to generate data if algo itself is involved in the process with a feedback loop?how to interpret a high AUC value but a low F1 score after upsampling minority class?

I use two machine learning algorithms for binary classification and I get this result :

Algo 1 :

 AUC- Train : 0.75 AUC- Test: 0.65 big Train / overfitting

Algo 2 :

 AUC- Train : 0.72 AUC- Test: 0.65 small train / small overfitting

Which one is better?

edited 2 days ago

Esmailian

1,346113

asked 2 days ago

amal amal

223

add a comment |

I use two machine learning algorithms for binary classification and I get this result :

Algo 1 :

 AUC- Train : 0.75 AUC- Test: 0.65 big Train / overfitting

Algo 2 :

 AUC- Train : 0.72 AUC- Test: 0.65 small train / small overfitting

Which one is better?

edited 2 days ago

Esmailian

1,346113

asked 2 days ago

amal amal

223

add a comment |

I use two machine learning algorithms for binary classification and I get this result :

Algo 1 :

 AUC- Train : 0.75 AUC- Test: 0.65 big Train / overfitting

Algo 2 :

 AUC- Train : 0.72 AUC- Test: 0.65 small train / small overfitting

Which one is better?

edited 2 days ago

Esmailian

1,346113

asked 2 days ago

amal amal

223

I use two machine learning algorithms for binary classification and I get this result :

Algo 1 :

 AUC- Train : 0.75 AUC- Test: 0.65 big Train / overfitting

Algo 2 :

 AUC- Train : 0.72 AUC- Test: 0.65 small train / small overfitting

Which one is better?

machine-learning data-mining metric

edited 2 days ago

Esmailian

1,346113

asked 2 days ago

amal amal

223

edited 2 days ago

Esmailian

1,346113

asked 2 days ago

amal amal

223

edited 2 days ago

Esmailian

1,346113

edited 2 days ago

Esmailian

1,346113

edited 2 days ago

Esmailian

1,346113

asked 2 days ago

amal amal

223

asked 2 days ago

amal amal

223

asked 2 days ago

amal amal

223

add a comment |

3 Answers
3

active

oldest

votes

Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).

Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.

edited 2 days ago

answered 2 days ago

Simon Larsson

4316

$begingroup$
Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
$endgroup$
– amal amal
2 days ago

$begingroup$
Yes, that is correct.
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
Thanks for your help
$endgroup$
– amal amal
2 days ago

$begingroup$
No problem! Don't forget to mark my answer as correct if you got what you asked for.
$endgroup$
– Simon Larsson
2 days ago

add a comment |

Algo 2

Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.

For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.

By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.

edited 2 days ago

answered 2 days ago

Esmailian

1,346113

$begingroup$
How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
Genuinely curious btw, incase you know something I have missed. :)
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
@SimonLarsson cool! I made some updates.
$endgroup$
– Esmailian
2 days ago

$begingroup$
Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
$endgroup$
– Simon Larsson
2 days ago

2

$begingroup$
@SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
$endgroup$
– Ben Reiniger
2 days ago

|
show 4 more comments

Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.

Disclaimer:

If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.

For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:

>>> from pycm import *

>>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0) 

>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]

and then see the value of the metrics focusing on the recommended metrics by the following code:

>>> print(cm)
 Predict 0 1 2 
 Actual
 0 1 0 0 
 1 0 1 2 
 2 0 1 0 




Overall Statistics : 

95% CI (-0.02941,0.82941)
Bennett_S 0.1
Chi-Squared 6.66667
Chi-Squared DF 4
Conditional Entropy 0.55098
Cramer_V 0.8165
Cross Entropy 1.52193
Gwet_AC1 0.13043
Joint Entropy 1.92193
KL Divergence 0.15098
Kappa 0.0625
Kappa 95% CI (-0.60846,0.73346)
Kappa No Prevalence -0.2
Kappa Standard Error 0.34233
Kappa Unbiased 0.03226
Lambda A 0.5
Lambda B 0.66667
Mutual Information 0.97095
Overall_ACC 0.4
Overall_RACC 0.36
Overall_RACCU 0.38
PPV_Macro 0.5
PPV_Micro 0.4
Phi-Squared 1.33333
Reference Entropy 1.37095
Response Entropy 1.52193
Scott_PI 0.03226
Standard Error 0.21909
Strength_Of_Agreement(Altman) Poor
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Slight
TPR_Macro 0.44444
TPR_Micro 0.4

Class Statistics :

Classes 0 1 2 
ACC(Accuracy) 1.0 0.4 0.4 
BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5 
DOR(Diagnostic odds ratio) None 0.5 0.0 
ERR(Error rate) 0.0 0.6 0.6 
F0.5(F0.5 score) 1.0 0.45455 0.0 
F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0 
F2(F2 score) 1.0 0.35714 0.0 
FDR(False discovery rate) 0.0 0.5 1.0 
FN(False negative/miss/type 2 error) 0 2 1 
FNR(Miss rate or false negative rate) 0.0 0.66667 1.0 
FOR(False omission rate) 0.0 0.66667 0.33333 
FP(False positive/type 1 error/false alarm) 0 1 2 
FPR(Fall-out or false positive rate) 0.0 0.5 0.5 
G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0 
LR+(Positive likelihood ratio) None 0.66667 0.0 
LR-(Negative likelihood ratio) 0.0 1.33333 2.0 
MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825 
MK(Markedness) 1.0 -0.16667 -0.33333 
N(Condition negative) 4 2 4 
NPV(Negative predictive value) 1.0 0.33333 0.66667 
P(Condition positive) 1 3 1 
POP(Population) 5 5 5 
PPV(Precision or positive predictive value) 1.0 0.5 0.0 
PRE(Prevalence) 0.2 0.6 0.2 
RACC(Random accuracy) 0.04 0.24 0.08 
RACCU(Random accuracy unbiased) 0.04 0.25 0.09 
TN(True negative/correct rejection) 4 1 2 
TNR(Specificity or true negative rate) 1.0 0.5 0.5 
TON(Test outcome negative) 4 3 3 
TOP(Test outcome positive) 1 2 2 
TP(True positive/hit) 1 1 0 
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0

edited 2 days ago

answered 2 days ago

Alireza Zolanvari

19114

1

$begingroup$
You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
$endgroup$
– Ben Reiniger
2 days ago

$begingroup$
thanks for your reminder.I just edited my answer
$endgroup$
– Alireza Zolanvari
2 days ago

$begingroup$
@alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
$endgroup$
– Esmailian
2 days ago

$begingroup$
@Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
$endgroup$
– Alireza Zolanvari
2 days ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47339%2fhow-to-select-between-models-when-auc-scores-are-similar%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).

edited 2 days ago

answered 2 days ago

Simon Larsson

4316

$begingroup$
Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
$endgroup$
– amal amal
2 days ago

$begingroup$
Yes, that is correct.
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
Thanks for your help
$endgroup$
– amal amal
2 days ago

$begingroup$
No problem! Don't forget to mark my answer as correct if you got what you asked for.
$endgroup$
– Simon Larsson
2 days ago

add a comment |

Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).

edited 2 days ago

answered 2 days ago

Simon Larsson

4316

$begingroup$
Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
$endgroup$
– amal amal
2 days ago

$begingroup$
Yes, that is correct.
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
Thanks for your help
$endgroup$
– amal amal
2 days ago

$begingroup$
No problem! Don't forget to mark my answer as correct if you got what you asked for.
$endgroup$
– Simon Larsson
2 days ago

add a comment |

Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).

edited 2 days ago

answered 2 days ago

Simon Larsson

4316

Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).

edited 2 days ago

answered 2 days ago

Simon Larsson

4316

edited 2 days ago

answered 2 days ago

Simon Larsson

4316

answered 2 days ago

Simon Larsson

4316

answered 2 days ago

Simon Larsson

4316

$begingroup$
Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
$endgroup$
– amal amal
2 days ago

$begingroup$
Yes, that is correct.
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
Thanks for your help
$endgroup$
– amal amal
2 days ago

$begingroup$
No problem! Don't forget to mark my answer as correct if you got what you asked for.
$endgroup$
– Simon Larsson
2 days ago

add a comment |

$begingroup$
Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
$endgroup$
– amal amal
2 days ago

$begingroup$
Yes, that is correct.
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
Thanks for your help
$endgroup$
– amal amal
2 days ago

$begingroup$
No problem! Don't forget to mark my answer as correct if you got what you asked for.
$endgroup$
– Simon Larsson
2 days ago

Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?

– amal amal
2 days ago

Yes, that is correct.

– Simon Larsson
2 days ago

Thanks for your help

– amal amal
2 days ago

No problem! Don't forget to mark my answer as correct if you got what you asked for.

– Simon Larsson
2 days ago

add a comment |

Algo 2

By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.

edited 2 days ago

answered 2 days ago

Esmailian

1,346113

$begingroup$
How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
Genuinely curious btw, incase you know something I have missed. :)
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
@SimonLarsson cool! I made some updates.
$endgroup$
– Esmailian
2 days ago

$begingroup$
Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
$endgroup$
– Simon Larsson
2 days ago

2

$begingroup$
@SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
$endgroup$
– Ben Reiniger
2 days ago

|
show 4 more comments

Algo 2

By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.

edited 2 days ago

answered 2 days ago

Esmailian

1,346113

$begingroup$
How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
Genuinely curious btw, incase you know something I have missed. :)
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
@SimonLarsson cool! I made some updates.
$endgroup$
– Esmailian
2 days ago

$begingroup$
Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
$endgroup$
– Simon Larsson
2 days ago

2

$begingroup$
@SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
$endgroup$
– Ben Reiniger
2 days ago

|
show 4 more comments

Algo 2

By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.

edited 2 days ago

answered 2 days ago

Esmailian

1,346113

Algo 2

By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.

edited 2 days ago

answered 2 days ago

Esmailian

1,346113

edited 2 days ago

answered 2 days ago

Esmailian

1,346113

answered 2 days ago

Esmailian

1,346113

answered 2 days ago

Esmailian

1,346113

$begingroup$
How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
Genuinely curious btw, incase you know something I have missed. :)
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
@SimonLarsson cool! I made some updates.
$endgroup$
– Esmailian
2 days ago

$begingroup$
Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
$endgroup$
– Simon Larsson
2 days ago

2

$begingroup$
@SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
$endgroup$
– Ben Reiniger
2 days ago

|
show 4 more comments

$begingroup$
How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
Genuinely curious btw, incase you know something I have missed. :)
$endgroup$
– Simon Larsson
2 days ago

$begingroup$
@SimonLarsson cool! I made some updates.
$endgroup$
– Esmailian
2 days ago

$begingroup$
Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
$endgroup$
– Simon Larsson
2 days ago

2

$begingroup$
@SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
$endgroup$
– Ben Reiniger
2 days ago

How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.

– Simon Larsson
2 days ago

Genuinely curious btw, incase you know something I have missed. :)

– Simon Larsson
2 days ago

@SimonLarsson cool! I made some updates.

– Esmailian
2 days ago

Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.

– Simon Larsson
2 days ago

@SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.

– Ben Reiniger
2 days ago

|
show 4 more comments

Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.

Disclaimer:

If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.

For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:

>>> from pycm import *

>>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0) 

>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]

and then see the value of the metrics focusing on the recommended metrics by the following code:

>>> print(cm)
 Predict 0 1 2 
 Actual
 0 1 0 0 
 1 0 1 2 
 2 0 1 0 




Overall Statistics : 

95% CI (-0.02941,0.82941)
Bennett_S 0.1
Chi-Squared 6.66667
Chi-Squared DF 4
Conditional Entropy 0.55098
Cramer_V 0.8165
Cross Entropy 1.52193
Gwet_AC1 0.13043
Joint Entropy 1.92193
KL Divergence 0.15098
Kappa 0.0625
Kappa 95% CI (-0.60846,0.73346)
Kappa No Prevalence -0.2
Kappa Standard Error 0.34233
Kappa Unbiased 0.03226
Lambda A 0.5
Lambda B 0.66667
Mutual Information 0.97095
Overall_ACC 0.4
Overall_RACC 0.36
Overall_RACCU 0.38
PPV_Macro 0.5
PPV_Micro 0.4
Phi-Squared 1.33333
Reference Entropy 1.37095
Response Entropy 1.52193
Scott_PI 0.03226
Standard Error 0.21909
Strength_Of_Agreement(Altman) Poor
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Slight
TPR_Macro 0.44444
TPR_Micro 0.4

Class Statistics :

Classes 0 1 2 
ACC(Accuracy) 1.0 0.4 0.4 
BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5 
DOR(Diagnostic odds ratio) None 0.5 0.0 
ERR(Error rate) 0.0 0.6 0.6 
F0.5(F0.5 score) 1.0 0.45455 0.0 
F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0 
F2(F2 score) 1.0 0.35714 0.0 
FDR(False discovery rate) 0.0 0.5 1.0 
FN(False negative/miss/type 2 error) 0 2 1 
FNR(Miss rate or false negative rate) 0.0 0.66667 1.0 
FOR(False omission rate) 0.0 0.66667 0.33333 
FP(False positive/type 1 error/false alarm) 0 1 2 
FPR(Fall-out or false positive rate) 0.0 0.5 0.5 
G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0 
LR+(Positive likelihood ratio) None 0.66667 0.0 
LR-(Negative likelihood ratio) 0.0 1.33333 2.0 
MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825 
MK(Markedness) 1.0 -0.16667 -0.33333 
N(Condition negative) 4 2 4 
NPV(Negative predictive value) 1.0 0.33333 0.66667 
P(Condition positive) 1 3 1 
POP(Population) 5 5 5 
PPV(Precision or positive predictive value) 1.0 0.5 0.0 
PRE(Prevalence) 0.2 0.6 0.2 
RACC(Random accuracy) 0.04 0.24 0.08 
RACCU(Random accuracy unbiased) 0.04 0.25 0.09 
TN(True negative/correct rejection) 4 1 2 
TNR(Specificity or true negative rate) 1.0 0.5 0.5 
TON(Test outcome negative) 4 3 3 
TOP(Test outcome positive) 1 2 2 
TP(True positive/hit) 1 1 0 
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0

edited 2 days ago

answered 2 days ago

Alireza Zolanvari

19114

1

$begingroup$
You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
$endgroup$
– Ben Reiniger
2 days ago

$begingroup$
thanks for your reminder.I just edited my answer
$endgroup$
– Alireza Zolanvari
2 days ago

$begingroup$
@alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
$endgroup$
– Esmailian
2 days ago

$begingroup$
@Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
$endgroup$
– Alireza Zolanvari
2 days ago

add a comment |

Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.

Disclaimer:

If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.

For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:

>>> from pycm import *

>>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0) 

>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]

and then see the value of the metrics focusing on the recommended metrics by the following code:

>>> print(cm)
 Predict 0 1 2 
 Actual
 0 1 0 0 
 1 0 1 2 
 2 0 1 0 




Overall Statistics : 

95% CI (-0.02941,0.82941)
Bennett_S 0.1
Chi-Squared 6.66667
Chi-Squared DF 4
Conditional Entropy 0.55098
Cramer_V 0.8165
Cross Entropy 1.52193
Gwet_AC1 0.13043
Joint Entropy 1.92193
KL Divergence 0.15098
Kappa 0.0625
Kappa 95% CI (-0.60846,0.73346)
Kappa No Prevalence -0.2
Kappa Standard Error 0.34233
Kappa Unbiased 0.03226
Lambda A 0.5
Lambda B 0.66667
Mutual Information 0.97095
Overall_ACC 0.4
Overall_RACC 0.36
Overall_RACCU 0.38
PPV_Macro 0.5
PPV_Micro 0.4
Phi-Squared 1.33333
Reference Entropy 1.37095
Response Entropy 1.52193
Scott_PI 0.03226
Standard Error 0.21909
Strength_Of_Agreement(Altman) Poor
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Slight
TPR_Macro 0.44444
TPR_Micro 0.4

Class Statistics :

Classes 0 1 2 
ACC(Accuracy) 1.0 0.4 0.4 
BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5 
DOR(Diagnostic odds ratio) None 0.5 0.0 
ERR(Error rate) 0.0 0.6 0.6 
F0.5(F0.5 score) 1.0 0.45455 0.0 
F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0 
F2(F2 score) 1.0 0.35714 0.0 
FDR(False discovery rate) 0.0 0.5 1.0 
FN(False negative/miss/type 2 error) 0 2 1 
FNR(Miss rate or false negative rate) 0.0 0.66667 1.0 
FOR(False omission rate) 0.0 0.66667 0.33333 
FP(False positive/type 1 error/false alarm) 0 1 2 
FPR(Fall-out or false positive rate) 0.0 0.5 0.5 
G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0 
LR+(Positive likelihood ratio) None 0.66667 0.0 
LR-(Negative likelihood ratio) 0.0 1.33333 2.0 
MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825 
MK(Markedness) 1.0 -0.16667 -0.33333 
N(Condition negative) 4 2 4 
NPV(Negative predictive value) 1.0 0.33333 0.66667 
P(Condition positive) 1 3 1 
POP(Population) 5 5 5 
PPV(Precision or positive predictive value) 1.0 0.5 0.0 
PRE(Prevalence) 0.2 0.6 0.2 
RACC(Random accuracy) 0.04 0.24 0.08 
RACCU(Random accuracy unbiased) 0.04 0.25 0.09 
TN(True negative/correct rejection) 4 1 2 
TNR(Specificity or true negative rate) 1.0 0.5 0.5 
TON(Test outcome negative) 4 3 3 
TOP(Test outcome positive) 1 2 2 
TP(True positive/hit) 1 1 0 
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0

edited 2 days ago

answered 2 days ago

Alireza Zolanvari

19114

1

$begingroup$
You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
$endgroup$
– Ben Reiniger
2 days ago

$begingroup$
thanks for your reminder.I just edited my answer
$endgroup$
– Alireza Zolanvari
2 days ago

$begingroup$
@alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
$endgroup$
– Esmailian
2 days ago

$begingroup$
@Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
$endgroup$
– Alireza Zolanvari
2 days ago

add a comment |

Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.

Disclaimer:

If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.

For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:

>>> from pycm import *

>>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0) 

>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]

and then see the value of the metrics focusing on the recommended metrics by the following code:

>>> print(cm)
 Predict 0 1 2 
 Actual
 0 1 0 0 
 1 0 1 2 
 2 0 1 0 




Overall Statistics : 

95% CI (-0.02941,0.82941)
Bennett_S 0.1
Chi-Squared 6.66667
Chi-Squared DF 4
Conditional Entropy 0.55098
Cramer_V 0.8165
Cross Entropy 1.52193
Gwet_AC1 0.13043
Joint Entropy 1.92193
KL Divergence 0.15098
Kappa 0.0625
Kappa 95% CI (-0.60846,0.73346)
Kappa No Prevalence -0.2
Kappa Standard Error 0.34233
Kappa Unbiased 0.03226
Lambda A 0.5
Lambda B 0.66667
Mutual Information 0.97095
Overall_ACC 0.4
Overall_RACC 0.36
Overall_RACCU 0.38
PPV_Macro 0.5
PPV_Micro 0.4
Phi-Squared 1.33333
Reference Entropy 1.37095
Response Entropy 1.52193
Scott_PI 0.03226
Standard Error 0.21909
Strength_Of_Agreement(Altman) Poor
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Slight
TPR_Macro 0.44444
TPR_Micro 0.4

Class Statistics :

Classes 0 1 2 
ACC(Accuracy) 1.0 0.4 0.4 
BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5 
DOR(Diagnostic odds ratio) None 0.5 0.0 
ERR(Error rate) 0.0 0.6 0.6 
F0.5(F0.5 score) 1.0 0.45455 0.0 
F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0 
F2(F2 score) 1.0 0.35714 0.0 
FDR(False discovery rate) 0.0 0.5 1.0 
FN(False negative/miss/type 2 error) 0 2 1 
FNR(Miss rate or false negative rate) 0.0 0.66667 1.0 
FOR(False omission rate) 0.0 0.66667 0.33333 
FP(False positive/type 1 error/false alarm) 0 1 2 
FPR(Fall-out or false positive rate) 0.0 0.5 0.5 
G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0 
LR+(Positive likelihood ratio) None 0.66667 0.0 
LR-(Negative likelihood ratio) 0.0 1.33333 2.0 
MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825 
MK(Markedness) 1.0 -0.16667 -0.33333 
N(Condition negative) 4 2 4 
NPV(Negative predictive value) 1.0 0.33333 0.66667 
P(Condition positive) 1 3 1 
POP(Population) 5 5 5 
PPV(Precision or positive predictive value) 1.0 0.5 0.0 
PRE(Prevalence) 0.2 0.6 0.2 
RACC(Random accuracy) 0.04 0.24 0.08 
RACCU(Random accuracy unbiased) 0.04 0.25 0.09 
TN(True negative/correct rejection) 4 1 2 
TNR(Specificity or true negative rate) 1.0 0.5 0.5 
TON(Test outcome negative) 4 3 3 
TOP(Test outcome positive) 1 2 2 
TP(True positive/hit) 1 1 0 
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0

edited 2 days ago

answered 2 days ago

Alireza Zolanvari

19114

Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.

Disclaimer:

If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.

For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:

>>> from pycm import *

>>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0) 

>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]

and then see the value of the metrics focusing on the recommended metrics by the following code:

>>> print(cm)
 Predict 0 1 2 
 Actual
 0 1 0 0 
 1 0 1 2 
 2 0 1 0 




Overall Statistics : 

95% CI (-0.02941,0.82941)
Bennett_S 0.1
Chi-Squared 6.66667
Chi-Squared DF 4
Conditional Entropy 0.55098
Cramer_V 0.8165
Cross Entropy 1.52193
Gwet_AC1 0.13043
Joint Entropy 1.92193
KL Divergence 0.15098
Kappa 0.0625
Kappa 95% CI (-0.60846,0.73346)
Kappa No Prevalence -0.2
Kappa Standard Error 0.34233
Kappa Unbiased 0.03226
Lambda A 0.5
Lambda B 0.66667
Mutual Information 0.97095
Overall_ACC 0.4
Overall_RACC 0.36
Overall_RACCU 0.38
PPV_Macro 0.5
PPV_Micro 0.4
Phi-Squared 1.33333
Reference Entropy 1.37095
Response Entropy 1.52193
Scott_PI 0.03226
Standard Error 0.21909
Strength_Of_Agreement(Altman) Poor
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Slight
TPR_Macro 0.44444
TPR_Micro 0.4

Class Statistics :

Classes 0 1 2 
ACC(Accuracy) 1.0 0.4 0.4 
BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5 
DOR(Diagnostic odds ratio) None 0.5 0.0 
ERR(Error rate) 0.0 0.6 0.6 
F0.5(F0.5 score) 1.0 0.45455 0.0 
F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0 
F2(F2 score) 1.0 0.35714 0.0 
FDR(False discovery rate) 0.0 0.5 1.0 
FN(False negative/miss/type 2 error) 0 2 1 
FNR(Miss rate or false negative rate) 0.0 0.66667 1.0 
FOR(False omission rate) 0.0 0.66667 0.33333 
FP(False positive/type 1 error/false alarm) 0 1 2 
FPR(Fall-out or false positive rate) 0.0 0.5 0.5 
G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0 
LR+(Positive likelihood ratio) None 0.66667 0.0 
LR-(Negative likelihood ratio) 0.0 1.33333 2.0 
MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825 
MK(Markedness) 1.0 -0.16667 -0.33333 
N(Condition negative) 4 2 4 
NPV(Negative predictive value) 1.0 0.33333 0.66667 
P(Condition positive) 1 3 1 
POP(Population) 5 5 5 
PPV(Precision or positive predictive value) 1.0 0.5 0.0 
PRE(Prevalence) 0.2 0.6 0.2 
RACC(Random accuracy) 0.04 0.24 0.08 
RACCU(Random accuracy unbiased) 0.04 0.25 0.09 
TN(True negative/correct rejection) 4 1 2 
TNR(Specificity or true negative rate) 1.0 0.5 0.5 
TON(Test outcome negative) 4 3 3 
TOP(Test outcome positive) 1 2 2 
TP(True positive/hit) 1 1 0 
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0

edited 2 days ago

answered 2 days ago

Alireza Zolanvari

19114

edited 2 days ago

answered 2 days ago

Alireza Zolanvari

19114

answered 2 days ago

Alireza Zolanvari

19114

answered 2 days ago

Alireza Zolanvari

19114

1

$begingroup$
You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
$endgroup$
– Ben Reiniger
2 days ago

$begingroup$
thanks for your reminder.I just edited my answer
$endgroup$
– Alireza Zolanvari
2 days ago

$begingroup$
@alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
$endgroup$
– Esmailian
2 days ago

$begingroup$
@Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
$endgroup$
– Alireza Zolanvari
2 days ago

add a comment |

1

$begingroup$
You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
$endgroup$
– Ben Reiniger
2 days ago

$begingroup$
thanks for your reminder.I just edited my answer
$endgroup$
– Alireza Zolanvari
2 days ago

$begingroup$
@alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
$endgroup$
– Esmailian
2 days ago

$begingroup$
@Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
$endgroup$
– Alireza Zolanvari
2 days ago

You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)

– Ben Reiniger
2 days ago

thanks for your reminder.I just edited my answer

– Alireza Zolanvari
2 days ago

@alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.

– Esmailian
2 days ago

@Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.

– Alireza Zolanvari
2 days ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Popular posts from this blog

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

3 Answers
3

3 Answers
3

3 Answers
3