I got 100% accuracy on my test set,is there something wrong?Random Forest Classifier gives very high accuracy on test set - overfitting?Consistently inconsistent cross-validation results that are wildly different from original model accuracyWrong train/test split strategyGood accuracy on train dataset with cross validation, but low accuracy on test datasetTrain Accuracy vs Test Accuracy vs Confusion matrixOOB decision function doesn't match prediction in scikit-learn RandomForestHow to visualize Ensemble Models ( Random Forest) with 1000 estimatorsTraining and Test setNot sure if over-fittingHow to avoid covariate shift in python and distribute classes in each train and test phase?Evaluating the test set

What is the difference between lands and mana?

How can I write humor as character trait?

How could a planet have erratic days?

The Digit Triangles

Are cause and effect the same as in our Universe in a non-relativistic, Newtonian Universe in which the speed of light is infinite?

How to preserve electronics (computers, iPads and phones) for hundreds of years

Shouldn’t conservatives embrace universal basic income?

Can I say "fingers" when referring to toes?

A Trivial Diagnosis

Which was the first story featuring espers?

Circuit Analysis: Obtaining Close Loop OP - AMP Transfer function

What fields between the rationals and the reals allow a good notion of 2D distance?

How would you translate "more" for use as an interface button?

What's the name of the logical fallacy where a debater extends a statement far beyond the original statement to make it true?

Mysterious "Two documentclass or documentstyle commands."

PTIJ: Why is Haman obsessed with Bose?

How do I Interface a PS/2 Keyboard without Modern Techniques?

What (the heck) is a Super Worm Equinox Moon?

Why the "ls" command is showing the permissions of files in a FAT32 partition?

Has the laser at Magurele, Romania reached a tenth of the Sun's power?

"It doesn't matter" or "it won't matter"?

Is it ethical to recieve stipend after publishing enough papers?

In a multiple cat home, how many litter boxes should you have?

Will the Sticky MAC access policy prevent unauthorized hubs from connecting to a network?



I got 100% accuracy on my test set,is there something wrong?


Random Forest Classifier gives very high accuracy on test set - overfitting?Consistently inconsistent cross-validation results that are wildly different from original model accuracyWrong train/test split strategyGood accuracy on train dataset with cross validation, but low accuracy on test datasetTrain Accuracy vs Test Accuracy vs Confusion matrixOOB decision function doesn't match prediction in scikit-learn RandomForestHow to visualize Ensemble Models ( Random Forest) with 1000 estimatorsTraining and Test setNot sure if over-fittingHow to avoid covariate shift in python and distribute classes in each train and test phase?Evaluating the test set













4












$begingroup$


I got 100% accuracy on my test set when trained using decision tree algorithm.but only got 85% accuracy on random forest



Is there something wrong with my model or is decision tree best suited for the dataset provided.



Code:



from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

#Random Forest

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf.fit(x_train, y_train);
predictions = rf.predict(x_test)
cm = sklearn.metrics.confusion_matrix(y_test,predictions)
print(cm)

#Decision Tree

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)
predictions = clf.predict(x_test)
cm = sklearn.metrics.confusion_matrix(y_test,predictions)


Confusion Matrix:



Random Forest:



[[19937 1]
[ 8 52]]


Decision Tree:



[[19938 0]
[ 0 60]]









share|improve this question











$endgroup$











  • $begingroup$
    Can we see the data? What is your train score?
    $endgroup$
    – Sam
    Jul 19 '18 at 10:55











  • $begingroup$
    I can't show the data as it is private I am doing transaction fraud detection the data is of actual transactions but about 99% of the data belongs to one class
    $endgroup$
    – Harigovind Valsakumar
    Jul 20 '18 at 4:08











  • $begingroup$
    AH OK. What is the metric you are using to evaluate model performance? Giving your class imbalance, I'd recommend using ROC AUC. If you cannot provide the data, could you show us a confusion matrix of your results?
    $endgroup$
    – Sam
    Jul 20 '18 at 5:19










  • $begingroup$
    confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
    $endgroup$
    – Harigovind Valsakumar
    Jul 20 '18 at 5:31










  • $begingroup$
    Hm. Can you show us some code?
    $endgroup$
    – Sam
    Jul 20 '18 at 5:36















4












$begingroup$


I got 100% accuracy on my test set when trained using decision tree algorithm.but only got 85% accuracy on random forest



Is there something wrong with my model or is decision tree best suited for the dataset provided.



Code:



from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

#Random Forest

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf.fit(x_train, y_train);
predictions = rf.predict(x_test)
cm = sklearn.metrics.confusion_matrix(y_test,predictions)
print(cm)

#Decision Tree

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)
predictions = clf.predict(x_test)
cm = sklearn.metrics.confusion_matrix(y_test,predictions)


Confusion Matrix:



Random Forest:



[[19937 1]
[ 8 52]]


Decision Tree:



[[19938 0]
[ 0 60]]









share|improve this question











$endgroup$











  • $begingroup$
    Can we see the data? What is your train score?
    $endgroup$
    – Sam
    Jul 19 '18 at 10:55











  • $begingroup$
    I can't show the data as it is private I am doing transaction fraud detection the data is of actual transactions but about 99% of the data belongs to one class
    $endgroup$
    – Harigovind Valsakumar
    Jul 20 '18 at 4:08











  • $begingroup$
    AH OK. What is the metric you are using to evaluate model performance? Giving your class imbalance, I'd recommend using ROC AUC. If you cannot provide the data, could you show us a confusion matrix of your results?
    $endgroup$
    – Sam
    Jul 20 '18 at 5:19










  • $begingroup$
    confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
    $endgroup$
    – Harigovind Valsakumar
    Jul 20 '18 at 5:31










  • $begingroup$
    Hm. Can you show us some code?
    $endgroup$
    – Sam
    Jul 20 '18 at 5:36













4












4








4





$begingroup$


I got 100% accuracy on my test set when trained using decision tree algorithm.but only got 85% accuracy on random forest



Is there something wrong with my model or is decision tree best suited for the dataset provided.



Code:



from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

#Random Forest

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf.fit(x_train, y_train);
predictions = rf.predict(x_test)
cm = sklearn.metrics.confusion_matrix(y_test,predictions)
print(cm)

#Decision Tree

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)
predictions = clf.predict(x_test)
cm = sklearn.metrics.confusion_matrix(y_test,predictions)


Confusion Matrix:



Random Forest:



[[19937 1]
[ 8 52]]


Decision Tree:



[[19938 0]
[ 0 60]]









share|improve this question











$endgroup$




I got 100% accuracy on my test set when trained using decision tree algorithm.but only got 85% accuracy on random forest



Is there something wrong with my model or is decision tree best suited for the dataset provided.



Code:



from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

#Random Forest

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf.fit(x_train, y_train);
predictions = rf.predict(x_test)
cm = sklearn.metrics.confusion_matrix(y_test,predictions)
print(cm)

#Decision Tree

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)
predictions = clf.predict(x_test)
cm = sklearn.metrics.confusion_matrix(y_test,predictions)


Confusion Matrix:



Random Forest:



[[19937 1]
[ 8 52]]


Decision Tree:



[[19938 0]
[ 0 60]]






scikit-learn random-forest decision-trees accuracy machine-learning-model






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jul 22 '18 at 14:04









Stephen Rauch

1,52551229




1,52551229










asked Jul 19 '18 at 8:16









Harigovind ValsakumarHarigovind Valsakumar

2116




2116











  • $begingroup$
    Can we see the data? What is your train score?
    $endgroup$
    – Sam
    Jul 19 '18 at 10:55











  • $begingroup$
    I can't show the data as it is private I am doing transaction fraud detection the data is of actual transactions but about 99% of the data belongs to one class
    $endgroup$
    – Harigovind Valsakumar
    Jul 20 '18 at 4:08











  • $begingroup$
    AH OK. What is the metric you are using to evaluate model performance? Giving your class imbalance, I'd recommend using ROC AUC. If you cannot provide the data, could you show us a confusion matrix of your results?
    $endgroup$
    – Sam
    Jul 20 '18 at 5:19










  • $begingroup$
    confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
    $endgroup$
    – Harigovind Valsakumar
    Jul 20 '18 at 5:31










  • $begingroup$
    Hm. Can you show us some code?
    $endgroup$
    – Sam
    Jul 20 '18 at 5:36
















  • $begingroup$
    Can we see the data? What is your train score?
    $endgroup$
    – Sam
    Jul 19 '18 at 10:55











  • $begingroup$
    I can't show the data as it is private I am doing transaction fraud detection the data is of actual transactions but about 99% of the data belongs to one class
    $endgroup$
    – Harigovind Valsakumar
    Jul 20 '18 at 4:08











  • $begingroup$
    AH OK. What is the metric you are using to evaluate model performance? Giving your class imbalance, I'd recommend using ROC AUC. If you cannot provide the data, could you show us a confusion matrix of your results?
    $endgroup$
    – Sam
    Jul 20 '18 at 5:19










  • $begingroup$
    confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
    $endgroup$
    – Harigovind Valsakumar
    Jul 20 '18 at 5:31










  • $begingroup$
    Hm. Can you show us some code?
    $endgroup$
    – Sam
    Jul 20 '18 at 5:36















$begingroup$
Can we see the data? What is your train score?
$endgroup$
– Sam
Jul 19 '18 at 10:55





$begingroup$
Can we see the data? What is your train score?
$endgroup$
– Sam
Jul 19 '18 at 10:55













$begingroup$
I can't show the data as it is private I am doing transaction fraud detection the data is of actual transactions but about 99% of the data belongs to one class
$endgroup$
– Harigovind Valsakumar
Jul 20 '18 at 4:08





$begingroup$
I can't show the data as it is private I am doing transaction fraud detection the data is of actual transactions but about 99% of the data belongs to one class
$endgroup$
– Harigovind Valsakumar
Jul 20 '18 at 4:08













$begingroup$
AH OK. What is the metric you are using to evaluate model performance? Giving your class imbalance, I'd recommend using ROC AUC. If you cannot provide the data, could you show us a confusion matrix of your results?
$endgroup$
– Sam
Jul 20 '18 at 5:19




$begingroup$
AH OK. What is the metric you are using to evaluate model performance? Giving your class imbalance, I'd recommend using ROC AUC. If you cannot provide the data, could you show us a confusion matrix of your results?
$endgroup$
– Sam
Jul 20 '18 at 5:19












$begingroup$
confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
$endgroup$
– Harigovind Valsakumar
Jul 20 '18 at 5:31




$begingroup$
confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
$endgroup$
– Harigovind Valsakumar
Jul 20 '18 at 5:31












$begingroup$
Hm. Can you show us some code?
$endgroup$
– Sam
Jul 20 '18 at 5:36




$begingroup$
Hm. Can you show us some code?
$endgroup$
– Sam
Jul 20 '18 at 5:36










6 Answers
6






active

oldest

votes


















4












$begingroup$

There may be a few reason this is happening.



  1. First of all, check your code. 100% accuracy seems unlikely in any setting. How many testing data points do you have? How many training data points did you train your model on? You may have made a coding mistake and compared two same list.


  2. Did you use different test set for testing? The high accuracy may be due to luck - try using some of the KFoldCrossValidation libraries that are widely available.


  3. You can visualise your decision tree to find our what is happening. If it has 100% accuracy on the test set, does it have 100% on the training set
    ?






share|improve this answer









$endgroup$












  • $begingroup$
    confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
    $endgroup$
    – Harigovind Valsakumar
    Jul 20 '18 at 4:03







  • 1




    $begingroup$
    Is this for training data or test data? What's the performance on training data? Please read my first 2 points again; did you ensure you are not training on the test data?
    $endgroup$
    – c zl
    Jul 20 '18 at 5:49










  • $begingroup$
    For test data about 80k rows in train and 20k rows in test
    $endgroup$
    – Harigovind Valsakumar
    Jul 20 '18 at 5:59







  • 1




    $begingroup$
    You are not answering my questions..
    $endgroup$
    – c zl
    Jul 20 '18 at 6:04










  • $begingroup$
    I used training data for training score : 0.9999499949995 100 accuracy for test data
    $endgroup$
    – Harigovind Valsakumar
    Jul 20 '18 at 6:11



















3












$begingroup$

The default hyper-parameters of the DecisionTreeClassifier allows it to overfit your training data.



The default min_samples_leaf is 1. The default max_depth is None. This combination allows your DecisionTreeClassifier to grow until there is a single data point at each leaf.



Since you are having $100%$ accuracy, I would assume you have duplicates in your train and test splits. This has nothing to do with the way you split but the way you cleaned your data.



Can you check if you have duplicate datapoints?



x = [[1, 2, 3],
[4, 5, 6],
[1, 2, 3]]

y = [1,
2,
1]

initial_number_of_data_points = len(x)


def get_unique(X_matrix, y_vector):
Xy = list(set(list(zip([tuple(x) for x in X_matrix], y_vector))))
X_matrix = [list(l[0]) for l in Xy]
y_vector = [l[1] for l in Xy]
return X_matrix, y_vector


x, y = get_unique(x, y)
data_points_removed = initial_number_of_data_points - len(x)
print("Number of duplicates removed:", data_points_removed )


If you have duplicates in your train and test splits, it is expected to have high accuracies.






share|improve this answer











$endgroup$




















    1












    $begingroup$

    Please check if you used your test set for building the model. This is a common scenario, like:



    Random Forest Classifier gives very high accuracy on test set - overfitting?



    If that was the case, everything was making sense. Random Forest was trying not to overfit your model, while a decision tree would just memorize your data as a tree.






    share|improve this answer









    $endgroup$




















      1












      $begingroup$

      Agree with c zl, in my experience this doesn't sound like a stable model and points to just a random lucky cut of the data. But something that will struggle to provide similar performance on unseen data.



      The best models are:



      1. high accuracy on train data

      2. and equally high accuracy on test data

      3. and where both accuracy metrics are not more than 5~10% of each other, which probably shows model stability. The lower difference the better, I feel.

      Bootstrapping and k-fold cross validation should usually provide more reliable performance numbers






      share|improve this answer











      $endgroup$




















        1












        $begingroup$

        1. As already mentioned here, DT are easy to overfit with the default parameters. So RF are usually the better choice compare to DT. Consider them as more generalized.

        2. Why the accuracies are different? RF takes always random variables to be used in algorithm (for single tree) but DT takes all. So, you have a number of features (not big one I suppose) that has a very big influence on the target variable in the whole dataset. Why so? Define what are them and research deeply.

        3. Now, can we say that DT is more stable for this particular task than RF? I'd say no because the situation may change. If your DT algorithm relies on 1-3 important features you can not be sure that these features will play the same significant role in the future.

        4. You can use DT but keep in mind that you have to retrain the model consistently. And compare results to RF.

        So, my advices:



        • implement feature importance (you can use RF model to get them) and use only important ones;

        • use some kind of KFold or StratifiedKFold here - it will give you better scores;

        • while you have imbalanced dataset - implement class_weight - should improve RF score;

        • implement GridSearchCV - it will also help you not to overfit (and 1000 trees are too much, tends to overfit too);

        • get train score also too all the time;

        • research important features.





        share|improve this answer









        $endgroup$




















          1












          $begingroup$

          I believe the problem you are facing is imbalance class problem. You have 99% data belongs to one class. May be the test data you have can be of that class only. Because 99% of the data belong to one class, there is high probability that your model will predict all your test data as that class.
          To deal with imbalance data you should use AUROC instead of accuracy. And you can use techniques like over sampling and under sampling to make it a balanced data set.






          share|improve this answer









          $endgroup$












            Your Answer





            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f35713%2fi-got-100-accuracy-on-my-test-set-is-there-something-wrong%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            6 Answers
            6






            active

            oldest

            votes








            6 Answers
            6






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            4












            $begingroup$

            There may be a few reason this is happening.



            1. First of all, check your code. 100% accuracy seems unlikely in any setting. How many testing data points do you have? How many training data points did you train your model on? You may have made a coding mistake and compared two same list.


            2. Did you use different test set for testing? The high accuracy may be due to luck - try using some of the KFoldCrossValidation libraries that are widely available.


            3. You can visualise your decision tree to find our what is happening. If it has 100% accuracy on the test set, does it have 100% on the training set
              ?






            share|improve this answer









            $endgroup$












            • $begingroup$
              confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
              $endgroup$
              – Harigovind Valsakumar
              Jul 20 '18 at 4:03







            • 1




              $begingroup$
              Is this for training data or test data? What's the performance on training data? Please read my first 2 points again; did you ensure you are not training on the test data?
              $endgroup$
              – c zl
              Jul 20 '18 at 5:49










            • $begingroup$
              For test data about 80k rows in train and 20k rows in test
              $endgroup$
              – Harigovind Valsakumar
              Jul 20 '18 at 5:59







            • 1




              $begingroup$
              You are not answering my questions..
              $endgroup$
              – c zl
              Jul 20 '18 at 6:04










            • $begingroup$
              I used training data for training score : 0.9999499949995 100 accuracy for test data
              $endgroup$
              – Harigovind Valsakumar
              Jul 20 '18 at 6:11
















            4












            $begingroup$

            There may be a few reason this is happening.



            1. First of all, check your code. 100% accuracy seems unlikely in any setting. How many testing data points do you have? How many training data points did you train your model on? You may have made a coding mistake and compared two same list.


            2. Did you use different test set for testing? The high accuracy may be due to luck - try using some of the KFoldCrossValidation libraries that are widely available.


            3. You can visualise your decision tree to find our what is happening. If it has 100% accuracy on the test set, does it have 100% on the training set
              ?






            share|improve this answer









            $endgroup$












            • $begingroup$
              confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
              $endgroup$
              – Harigovind Valsakumar
              Jul 20 '18 at 4:03







            • 1




              $begingroup$
              Is this for training data or test data? What's the performance on training data? Please read my first 2 points again; did you ensure you are not training on the test data?
              $endgroup$
              – c zl
              Jul 20 '18 at 5:49










            • $begingroup$
              For test data about 80k rows in train and 20k rows in test
              $endgroup$
              – Harigovind Valsakumar
              Jul 20 '18 at 5:59







            • 1




              $begingroup$
              You are not answering my questions..
              $endgroup$
              – c zl
              Jul 20 '18 at 6:04










            • $begingroup$
              I used training data for training score : 0.9999499949995 100 accuracy for test data
              $endgroup$
              – Harigovind Valsakumar
              Jul 20 '18 at 6:11














            4












            4








            4





            $begingroup$

            There may be a few reason this is happening.



            1. First of all, check your code. 100% accuracy seems unlikely in any setting. How many testing data points do you have? How many training data points did you train your model on? You may have made a coding mistake and compared two same list.


            2. Did you use different test set for testing? The high accuracy may be due to luck - try using some of the KFoldCrossValidation libraries that are widely available.


            3. You can visualise your decision tree to find our what is happening. If it has 100% accuracy on the test set, does it have 100% on the training set
              ?






            share|improve this answer









            $endgroup$



            There may be a few reason this is happening.



            1. First of all, check your code. 100% accuracy seems unlikely in any setting. How many testing data points do you have? How many training data points did you train your model on? You may have made a coding mistake and compared two same list.


            2. Did you use different test set for testing? The high accuracy may be due to luck - try using some of the KFoldCrossValidation libraries that are widely available.


            3. You can visualise your decision tree to find our what is happening. If it has 100% accuracy on the test set, does it have 100% on the training set
              ?







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jul 19 '18 at 9:48









            c zlc zl

            693




            693











            • $begingroup$
              confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
              $endgroup$
              – Harigovind Valsakumar
              Jul 20 '18 at 4:03







            • 1




              $begingroup$
              Is this for training data or test data? What's the performance on training data? Please read my first 2 points again; did you ensure you are not training on the test data?
              $endgroup$
              – c zl
              Jul 20 '18 at 5:49










            • $begingroup$
              For test data about 80k rows in train and 20k rows in test
              $endgroup$
              – Harigovind Valsakumar
              Jul 20 '18 at 5:59







            • 1




              $begingroup$
              You are not answering my questions..
              $endgroup$
              – c zl
              Jul 20 '18 at 6:04










            • $begingroup$
              I used training data for training score : 0.9999499949995 100 accuracy for test data
              $endgroup$
              – Harigovind Valsakumar
              Jul 20 '18 at 6:11

















            • $begingroup$
              confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
              $endgroup$
              – Harigovind Valsakumar
              Jul 20 '18 at 4:03







            • 1




              $begingroup$
              Is this for training data or test data? What's the performance on training data? Please read my first 2 points again; did you ensure you are not training on the test data?
              $endgroup$
              – c zl
              Jul 20 '18 at 5:49










            • $begingroup$
              For test data about 80k rows in train and 20k rows in test
              $endgroup$
              – Harigovind Valsakumar
              Jul 20 '18 at 5:59







            • 1




              $begingroup$
              You are not answering my questions..
              $endgroup$
              – c zl
              Jul 20 '18 at 6:04










            • $begingroup$
              I used training data for training score : 0.9999499949995 100 accuracy for test data
              $endgroup$
              – Harigovind Valsakumar
              Jul 20 '18 at 6:11
















            $begingroup$
            confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
            $endgroup$
            – Harigovind Valsakumar
            Jul 20 '18 at 4:03





            $begingroup$
            confusion matrix for random forest [[19937 1] [ 8 52]] confusion matrix for decision tree [[19938 0] [ 0 60]]
            $endgroup$
            – Harigovind Valsakumar
            Jul 20 '18 at 4:03





            1




            1




            $begingroup$
            Is this for training data or test data? What's the performance on training data? Please read my first 2 points again; did you ensure you are not training on the test data?
            $endgroup$
            – c zl
            Jul 20 '18 at 5:49




            $begingroup$
            Is this for training data or test data? What's the performance on training data? Please read my first 2 points again; did you ensure you are not training on the test data?
            $endgroup$
            – c zl
            Jul 20 '18 at 5:49












            $begingroup$
            For test data about 80k rows in train and 20k rows in test
            $endgroup$
            – Harigovind Valsakumar
            Jul 20 '18 at 5:59





            $begingroup$
            For test data about 80k rows in train and 20k rows in test
            $endgroup$
            – Harigovind Valsakumar
            Jul 20 '18 at 5:59





            1




            1




            $begingroup$
            You are not answering my questions..
            $endgroup$
            – c zl
            Jul 20 '18 at 6:04




            $begingroup$
            You are not answering my questions..
            $endgroup$
            – c zl
            Jul 20 '18 at 6:04












            $begingroup$
            I used training data for training score : 0.9999499949995 100 accuracy for test data
            $endgroup$
            – Harigovind Valsakumar
            Jul 20 '18 at 6:11





            $begingroup$
            I used training data for training score : 0.9999499949995 100 accuracy for test data
            $endgroup$
            – Harigovind Valsakumar
            Jul 20 '18 at 6:11












            3












            $begingroup$

            The default hyper-parameters of the DecisionTreeClassifier allows it to overfit your training data.



            The default min_samples_leaf is 1. The default max_depth is None. This combination allows your DecisionTreeClassifier to grow until there is a single data point at each leaf.



            Since you are having $100%$ accuracy, I would assume you have duplicates in your train and test splits. This has nothing to do with the way you split but the way you cleaned your data.



            Can you check if you have duplicate datapoints?



            x = [[1, 2, 3],
            [4, 5, 6],
            [1, 2, 3]]

            y = [1,
            2,
            1]

            initial_number_of_data_points = len(x)


            def get_unique(X_matrix, y_vector):
            Xy = list(set(list(zip([tuple(x) for x in X_matrix], y_vector))))
            X_matrix = [list(l[0]) for l in Xy]
            y_vector = [l[1] for l in Xy]
            return X_matrix, y_vector


            x, y = get_unique(x, y)
            data_points_removed = initial_number_of_data_points - len(x)
            print("Number of duplicates removed:", data_points_removed )


            If you have duplicates in your train and test splits, it is expected to have high accuracies.






            share|improve this answer











            $endgroup$

















              3












              $begingroup$

              The default hyper-parameters of the DecisionTreeClassifier allows it to overfit your training data.



              The default min_samples_leaf is 1. The default max_depth is None. This combination allows your DecisionTreeClassifier to grow until there is a single data point at each leaf.



              Since you are having $100%$ accuracy, I would assume you have duplicates in your train and test splits. This has nothing to do with the way you split but the way you cleaned your data.



              Can you check if you have duplicate datapoints?



              x = [[1, 2, 3],
              [4, 5, 6],
              [1, 2, 3]]

              y = [1,
              2,
              1]

              initial_number_of_data_points = len(x)


              def get_unique(X_matrix, y_vector):
              Xy = list(set(list(zip([tuple(x) for x in X_matrix], y_vector))))
              X_matrix = [list(l[0]) for l in Xy]
              y_vector = [l[1] for l in Xy]
              return X_matrix, y_vector


              x, y = get_unique(x, y)
              data_points_removed = initial_number_of_data_points - len(x)
              print("Number of duplicates removed:", data_points_removed )


              If you have duplicates in your train and test splits, it is expected to have high accuracies.






              share|improve this answer











              $endgroup$















                3












                3








                3





                $begingroup$

                The default hyper-parameters of the DecisionTreeClassifier allows it to overfit your training data.



                The default min_samples_leaf is 1. The default max_depth is None. This combination allows your DecisionTreeClassifier to grow until there is a single data point at each leaf.



                Since you are having $100%$ accuracy, I would assume you have duplicates in your train and test splits. This has nothing to do with the way you split but the way you cleaned your data.



                Can you check if you have duplicate datapoints?



                x = [[1, 2, 3],
                [4, 5, 6],
                [1, 2, 3]]

                y = [1,
                2,
                1]

                initial_number_of_data_points = len(x)


                def get_unique(X_matrix, y_vector):
                Xy = list(set(list(zip([tuple(x) for x in X_matrix], y_vector))))
                X_matrix = [list(l[0]) for l in Xy]
                y_vector = [l[1] for l in Xy]
                return X_matrix, y_vector


                x, y = get_unique(x, y)
                data_points_removed = initial_number_of_data_points - len(x)
                print("Number of duplicates removed:", data_points_removed )


                If you have duplicates in your train and test splits, it is expected to have high accuracies.






                share|improve this answer











                $endgroup$



                The default hyper-parameters of the DecisionTreeClassifier allows it to overfit your training data.



                The default min_samples_leaf is 1. The default max_depth is None. This combination allows your DecisionTreeClassifier to grow until there is a single data point at each leaf.



                Since you are having $100%$ accuracy, I would assume you have duplicates in your train and test splits. This has nothing to do with the way you split but the way you cleaned your data.



                Can you check if you have duplicate datapoints?



                x = [[1, 2, 3],
                [4, 5, 6],
                [1, 2, 3]]

                y = [1,
                2,
                1]

                initial_number_of_data_points = len(x)


                def get_unique(X_matrix, y_vector):
                Xy = list(set(list(zip([tuple(x) for x in X_matrix], y_vector))))
                X_matrix = [list(l[0]) for l in Xy]
                y_vector = [l[1] for l in Xy]
                return X_matrix, y_vector


                x, y = get_unique(x, y)
                data_points_removed = initial_number_of_data_points - len(x)
                print("Number of duplicates removed:", data_points_removed )


                If you have duplicates in your train and test splits, it is expected to have high accuracies.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited 2 days ago









                Blenzus

                386




                386










                answered Jul 22 '18 at 13:48









                BrunoGLBrunoGL

                1,016121




                1,016121





















                    1












                    $begingroup$

                    Please check if you used your test set for building the model. This is a common scenario, like:



                    Random Forest Classifier gives very high accuracy on test set - overfitting?



                    If that was the case, everything was making sense. Random Forest was trying not to overfit your model, while a decision tree would just memorize your data as a tree.






                    share|improve this answer









                    $endgroup$

















                      1












                      $begingroup$

                      Please check if you used your test set for building the model. This is a common scenario, like:



                      Random Forest Classifier gives very high accuracy on test set - overfitting?



                      If that was the case, everything was making sense. Random Forest was trying not to overfit your model, while a decision tree would just memorize your data as a tree.






                      share|improve this answer









                      $endgroup$















                        1












                        1








                        1





                        $begingroup$

                        Please check if you used your test set for building the model. This is a common scenario, like:



                        Random Forest Classifier gives very high accuracy on test set - overfitting?



                        If that was the case, everything was making sense. Random Forest was trying not to overfit your model, while a decision tree would just memorize your data as a tree.






                        share|improve this answer









                        $endgroup$



                        Please check if you used your test set for building the model. This is a common scenario, like:



                        Random Forest Classifier gives very high accuracy on test set - overfitting?



                        If that was the case, everything was making sense. Random Forest was trying not to overfit your model, while a decision tree would just memorize your data as a tree.







                        share|improve this answer












                        share|improve this answer



                        share|improve this answer










                        answered Jul 19 '18 at 10:38









                        SmallChessSmallChess

                        2,2742921




                        2,2742921





















                            1












                            $begingroup$

                            Agree with c zl, in my experience this doesn't sound like a stable model and points to just a random lucky cut of the data. But something that will struggle to provide similar performance on unseen data.



                            The best models are:



                            1. high accuracy on train data

                            2. and equally high accuracy on test data

                            3. and where both accuracy metrics are not more than 5~10% of each other, which probably shows model stability. The lower difference the better, I feel.

                            Bootstrapping and k-fold cross validation should usually provide more reliable performance numbers






                            share|improve this answer











                            $endgroup$

















                              1












                              $begingroup$

                              Agree with c zl, in my experience this doesn't sound like a stable model and points to just a random lucky cut of the data. But something that will struggle to provide similar performance on unseen data.



                              The best models are:



                              1. high accuracy on train data

                              2. and equally high accuracy on test data

                              3. and where both accuracy metrics are not more than 5~10% of each other, which probably shows model stability. The lower difference the better, I feel.

                              Bootstrapping and k-fold cross validation should usually provide more reliable performance numbers






                              share|improve this answer











                              $endgroup$















                                1












                                1








                                1





                                $begingroup$

                                Agree with c zl, in my experience this doesn't sound like a stable model and points to just a random lucky cut of the data. But something that will struggle to provide similar performance on unseen data.



                                The best models are:



                                1. high accuracy on train data

                                2. and equally high accuracy on test data

                                3. and where both accuracy metrics are not more than 5~10% of each other, which probably shows model stability. The lower difference the better, I feel.

                                Bootstrapping and k-fold cross validation should usually provide more reliable performance numbers






                                share|improve this answer











                                $endgroup$



                                Agree with c zl, in my experience this doesn't sound like a stable model and points to just a random lucky cut of the data. But something that will struggle to provide similar performance on unseen data.



                                The best models are:



                                1. high accuracy on train data

                                2. and equally high accuracy on test data

                                3. and where both accuracy metrics are not more than 5~10% of each other, which probably shows model stability. The lower difference the better, I feel.

                                Bootstrapping and k-fold cross validation should usually provide more reliable performance numbers







                                share|improve this answer














                                share|improve this answer



                                share|improve this answer








                                edited Jul 20 '18 at 13:44









                                Stephen Rauch

                                1,52551229




                                1,52551229










                                answered Jul 20 '18 at 13:35









                                Dan9ieDan9ie

                                212




                                212





















                                    1












                                    $begingroup$

                                    1. As already mentioned here, DT are easy to overfit with the default parameters. So RF are usually the better choice compare to DT. Consider them as more generalized.

                                    2. Why the accuracies are different? RF takes always random variables to be used in algorithm (for single tree) but DT takes all. So, you have a number of features (not big one I suppose) that has a very big influence on the target variable in the whole dataset. Why so? Define what are them and research deeply.

                                    3. Now, can we say that DT is more stable for this particular task than RF? I'd say no because the situation may change. If your DT algorithm relies on 1-3 important features you can not be sure that these features will play the same significant role in the future.

                                    4. You can use DT but keep in mind that you have to retrain the model consistently. And compare results to RF.

                                    So, my advices:



                                    • implement feature importance (you can use RF model to get them) and use only important ones;

                                    • use some kind of KFold or StratifiedKFold here - it will give you better scores;

                                    • while you have imbalanced dataset - implement class_weight - should improve RF score;

                                    • implement GridSearchCV - it will also help you not to overfit (and 1000 trees are too much, tends to overfit too);

                                    • get train score also too all the time;

                                    • research important features.





                                    share|improve this answer









                                    $endgroup$

















                                      1












                                      $begingroup$

                                      1. As already mentioned here, DT are easy to overfit with the default parameters. So RF are usually the better choice compare to DT. Consider them as more generalized.

                                      2. Why the accuracies are different? RF takes always random variables to be used in algorithm (for single tree) but DT takes all. So, you have a number of features (not big one I suppose) that has a very big influence on the target variable in the whole dataset. Why so? Define what are them and research deeply.

                                      3. Now, can we say that DT is more stable for this particular task than RF? I'd say no because the situation may change. If your DT algorithm relies on 1-3 important features you can not be sure that these features will play the same significant role in the future.

                                      4. You can use DT but keep in mind that you have to retrain the model consistently. And compare results to RF.

                                      So, my advices:



                                      • implement feature importance (you can use RF model to get them) and use only important ones;

                                      • use some kind of KFold or StratifiedKFold here - it will give you better scores;

                                      • while you have imbalanced dataset - implement class_weight - should improve RF score;

                                      • implement GridSearchCV - it will also help you not to overfit (and 1000 trees are too much, tends to overfit too);

                                      • get train score also too all the time;

                                      • research important features.





                                      share|improve this answer









                                      $endgroup$















                                        1












                                        1








                                        1





                                        $begingroup$

                                        1. As already mentioned here, DT are easy to overfit with the default parameters. So RF are usually the better choice compare to DT. Consider them as more generalized.

                                        2. Why the accuracies are different? RF takes always random variables to be used in algorithm (for single tree) but DT takes all. So, you have a number of features (not big one I suppose) that has a very big influence on the target variable in the whole dataset. Why so? Define what are them and research deeply.

                                        3. Now, can we say that DT is more stable for this particular task than RF? I'd say no because the situation may change. If your DT algorithm relies on 1-3 important features you can not be sure that these features will play the same significant role in the future.

                                        4. You can use DT but keep in mind that you have to retrain the model consistently. And compare results to RF.

                                        So, my advices:



                                        • implement feature importance (you can use RF model to get them) and use only important ones;

                                        • use some kind of KFold or StratifiedKFold here - it will give you better scores;

                                        • while you have imbalanced dataset - implement class_weight - should improve RF score;

                                        • implement GridSearchCV - it will also help you not to overfit (and 1000 trees are too much, tends to overfit too);

                                        • get train score also too all the time;

                                        • research important features.





                                        share|improve this answer









                                        $endgroup$



                                        1. As already mentioned here, DT are easy to overfit with the default parameters. So RF are usually the better choice compare to DT. Consider them as more generalized.

                                        2. Why the accuracies are different? RF takes always random variables to be used in algorithm (for single tree) but DT takes all. So, you have a number of features (not big one I suppose) that has a very big influence on the target variable in the whole dataset. Why so? Define what are them and research deeply.

                                        3. Now, can we say that DT is more stable for this particular task than RF? I'd say no because the situation may change. If your DT algorithm relies on 1-3 important features you can not be sure that these features will play the same significant role in the future.

                                        4. You can use DT but keep in mind that you have to retrain the model consistently. And compare results to RF.

                                        So, my advices:



                                        • implement feature importance (you can use RF model to get them) and use only important ones;

                                        • use some kind of KFold or StratifiedKFold here - it will give you better scores;

                                        • while you have imbalanced dataset - implement class_weight - should improve RF score;

                                        • implement GridSearchCV - it will also help you not to overfit (and 1000 trees are too much, tends to overfit too);

                                        • get train score also too all the time;

                                        • research important features.






                                        share|improve this answer












                                        share|improve this answer



                                        share|improve this answer










                                        answered Jul 26 '18 at 6:15









                                        avchauzovavchauzov

                                        1412




                                        1412





















                                            1












                                            $begingroup$

                                            I believe the problem you are facing is imbalance class problem. You have 99% data belongs to one class. May be the test data you have can be of that class only. Because 99% of the data belong to one class, there is high probability that your model will predict all your test data as that class.
                                            To deal with imbalance data you should use AUROC instead of accuracy. And you can use techniques like over sampling and under sampling to make it a balanced data set.






                                            share|improve this answer









                                            $endgroup$

















                                              1












                                              $begingroup$

                                              I believe the problem you are facing is imbalance class problem. You have 99% data belongs to one class. May be the test data you have can be of that class only. Because 99% of the data belong to one class, there is high probability that your model will predict all your test data as that class.
                                              To deal with imbalance data you should use AUROC instead of accuracy. And you can use techniques like over sampling and under sampling to make it a balanced data set.






                                              share|improve this answer









                                              $endgroup$















                                                1












                                                1








                                                1





                                                $begingroup$

                                                I believe the problem you are facing is imbalance class problem. You have 99% data belongs to one class. May be the test data you have can be of that class only. Because 99% of the data belong to one class, there is high probability that your model will predict all your test data as that class.
                                                To deal with imbalance data you should use AUROC instead of accuracy. And you can use techniques like over sampling and under sampling to make it a balanced data set.






                                                share|improve this answer









                                                $endgroup$



                                                I believe the problem you are facing is imbalance class problem. You have 99% data belongs to one class. May be the test data you have can be of that class only. Because 99% of the data belong to one class, there is high probability that your model will predict all your test data as that class.
                                                To deal with imbalance data you should use AUROC instead of accuracy. And you can use techniques like over sampling and under sampling to make it a balanced data set.







                                                share|improve this answer












                                                share|improve this answer



                                                share|improve this answer










                                                answered Sep 11 '18 at 14:52









                                                VikramVikram

                                                111




                                                111



























                                                    draft saved

                                                    draft discarded
















































                                                    Thanks for contributing an answer to Data Science Stack Exchange!


                                                    • Please be sure to answer the question. Provide details and share your research!

                                                    But avoid


                                                    • Asking for help, clarification, or responding to other answers.

                                                    • Making statements based on opinion; back them up with references or personal experience.

                                                    Use MathJax to format equations. MathJax reference.


                                                    To learn more, see our tips on writing great answers.




                                                    draft saved


                                                    draft discarded














                                                    StackExchange.ready(
                                                    function ()
                                                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f35713%2fi-got-100-accuracy-on-my-test-set-is-there-something-wrong%23new-answer', 'question_page');

                                                    );

                                                    Post as a guest















                                                    Required, but never shown





















































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown

































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown







                                                    Popular posts from this blog

                                                    Is flight data recorder erased after every flight?When are black boxes used?What protects the location beacon (pinger) of a flight data recorder?Is there anywhere I can pick up raw flight data recorder information?Who legally owns the Flight Data Recorder?Constructing flight recorder dataWhy are FDRs and CVRs still two separate physical devices?What are the data elements shown on the GE235 flight data recorder (FDR) plot?Are CVR and FDR reset after every flight?What is the format of data stored by a Flight Data Recorder?How much data is stored in the flight data recorder per hour in a typical flight of an A380?Is a smart flight data recorder possible?

                                                    Is there a general name for the setup in which payoffs are not known exactly but players try to influence each other's perception of the payoffs?Osborne, Nash equilibria and the correctness of beliefsIs there a name for this family of games (Binomial games?)?Perfect Bayesian EquilibriumCalculating mixed strategy equilibrium in battle of sexesPure Strategy SPNEIs there a commitment mechanism which allows players to achieve pareto optimal solutions?Extensive Form GamesAn $n$-player prisoner's dilemma where a coalition of 2 players is better off defectingTit-For-Stat Strategy Best RepliesPotential solutions of the $n$-player Prisoner's Dilemma

                                                    Which is better: GPT or RelGAN for text generation?2019 Community Moderator ElectionWhat is the difference between TextGAN and LM for text generation?GANs (generative adversarial networks) possible for text as well?Generator loss not decreasing- text to image synthesisChoosing a right algorithm for template-based text generationHow should I format input and output for text generation with LSTMsGumbel Softmax vs Vanilla Softmax for GAN trainingWhich neural network to choose for classification from text/speech?NLP text autoencoder that generates text in poetic meterWhat is the interpretation of the expectation notation in the GAN formulation?What is the difference between TextGAN and LM for text generation?How to prepare the data for text generation task