Why can decision trees have a high amount of variance2019 Community Moderator ElectionAre decision tree algorithms linear or nonlinearWhy do we pick random features in random forestDistributed Scalable Decision TreesAggregating Decision TreesHow can decision trees be tuned for non-symmetrical loss?Pruning and parameter reduction for decision treesOOB decision function doesn't match prediction in scikit-learn RandomForestDecision trees and Curse of DimensionalityRegression Decision Tree - Normalize or Split into Ranges a continuos featureWhy Decision trees performs better than logistic regressionLinear machine learning algorithms “often” have high bias/low variance?Why underfitting is called high bias and overfitting is called high variance?

How to move the player while also allowing forces to affect it

What does 'script /dev/null' do?

Springs with some finite mass

Is std::next for vector O(n) or O(1)?

Why is my log file so massive? 22gb. I am running log backups

What is GPS' 19 year rollover and does it present a cybersecurity issue?

Can I legally use front facing blue light in the UK?

Can a planet have a different gravitational pull depending on its location in orbit around its sun?

What is the steepest angle that a canal can be traversable without locks?

Why was the "bread communication" in the arena of Catching Fire left out in the movie?

I see my dog run

How to make payment on the internet without leaving a money trail?

What are the advantages and disadvantages of running one shots compared to campaigns?

Is domain driven design an anti-SQL pattern?

"listening to me about as much as you're listening to this pole here"

Shall I use personal or official e-mail account when registering to external websites for work purpose?

Re-submission of rejected manuscript without informing co-authors

Finding files for which a command fails

A poker game description that does not feel gimmicky

Are there any other methods to apply to solving simultaneous equations?

extract characters between two commas?

Inflated grade on resume at previous job, might former employer tell new employer?

How to manage monthly salary

Why isn't airport relocation done gradually?



Why can decision trees have a high amount of variance



2019 Community Moderator ElectionAre decision tree algorithms linear or nonlinearWhy do we pick random features in random forestDistributed Scalable Decision TreesAggregating Decision TreesHow can decision trees be tuned for non-symmetrical loss?Pruning and parameter reduction for decision treesOOB decision function doesn't match prediction in scikit-learn RandomForestDecision trees and Curse of DimensionalityRegression Decision Tree - Normalize or Split into Ranges a continuos featureWhy Decision trees performs better than logistic regressionLinear machine learning algorithms “often” have high bias/low variance?Why underfitting is called high bias and overfitting is called high variance?










1












$begingroup$


I've heard that decision trees can have a high amount of variance, and that for a data set $D$ split into test/train the decision tree could be quite different depending on how the data was split. Apparently, this provides motivation for algorithms such as Random Forest.



Is this correct? Why does a decision tree suffer from high variability?



edit



just to note that I don't really follow the current answer
and haven't been able to solve that in the comments.










share|improve this question











$endgroup$
















    1












    $begingroup$


    I've heard that decision trees can have a high amount of variance, and that for a data set $D$ split into test/train the decision tree could be quite different depending on how the data was split. Apparently, this provides motivation for algorithms such as Random Forest.



    Is this correct? Why does a decision tree suffer from high variability?



    edit



    just to note that I don't really follow the current answer
    and haven't been able to solve that in the comments.










    share|improve this question











    $endgroup$














      1












      1








      1





      $begingroup$


      I've heard that decision trees can have a high amount of variance, and that for a data set $D$ split into test/train the decision tree could be quite different depending on how the data was split. Apparently, this provides motivation for algorithms such as Random Forest.



      Is this correct? Why does a decision tree suffer from high variability?



      edit



      just to note that I don't really follow the current answer
      and haven't been able to solve that in the comments.










      share|improve this question











      $endgroup$




      I've heard that decision trees can have a high amount of variance, and that for a data set $D$ split into test/train the decision tree could be quite different depending on how the data was split. Apparently, this provides motivation for algorithms such as Random Forest.



      Is this correct? Why does a decision tree suffer from high variability?



      edit



      just to note that I don't really follow the current answer
      and haven't been able to solve that in the comments.







      machine-learning classification decision-trees training variance






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 28 at 20:49







      baxx

















      asked Mar 28 at 17:57









      baxxbaxx

      1314




      1314




















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          The point is that if your training data does not have the same input features with different labels which leads to $0$ Bayes error, the decision tree can learn it entirely and that can lead to overfitting also known as high variance. This is why people usually use pruning using cross-validation for avoiding the trees to get overfitted to the training data.



          Decision trees are powerful classifiers. Algorithms such as Bagging try to use powerful classifiers in order to achieve ensemble learning for finding a classifier that does not have high variance. One way can be ignoring some features and using the others, Random Forest, in order to find the best features which can generalize well. The other can be using choosing random training data for training each decision tree and after that put it that again inside the training data, bootstrapping.



          The reason that decision trees can overfit is due to their VC. Although it is not infinite, unlike 1-NN, it is very large which leads to overfitting. It simply means you have to provide multiple numerous data in order not to overfit. For understanding VC dimension of decision trees, take a look at Are decision tree algorithms linear or nonlinear.






          share|improve this answer









          $endgroup$












          • $begingroup$
            "the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.
            $endgroup$
            – baxx
            Mar 28 at 18:29










          • $begingroup$
            @baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.
            $endgroup$
            – Vaalizaadeh
            Mar 28 at 19:34










          • $begingroup$
            if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)
            $endgroup$
            – baxx
            Mar 28 at 20:08










          • $begingroup$
            Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have
            $endgroup$
            – Vaalizaadeh
            Mar 28 at 20:13











          • $begingroup$
            intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.
            $endgroup$
            – Vaalizaadeh
            Mar 28 at 20:14


















          1












          $begingroup$

          It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.



          A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.



          Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:




          IF



          1. player X is on the field AND

          2. team A has a home game AND

          3. the weather is sunny AND

          4. the number of attending fans >= 26000 AND

          5. it is past 3pm

          THEN team A wins.




          If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.



          Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).



          Linear regression, for example, would not be so sensitive to a small change because it is limited ("biased" -> see bias-variance tradeoff) to linear relationships and cannot represent sudden changes from 25999 to 26000 fans.



          That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.



          (See e.g. here for more on how random forests can help with this further.)






          share|improve this answer









          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48166%2fwhy-can-decision-trees-have-a-high-amount-of-variance%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            The point is that if your training data does not have the same input features with different labels which leads to $0$ Bayes error, the decision tree can learn it entirely and that can lead to overfitting also known as high variance. This is why people usually use pruning using cross-validation for avoiding the trees to get overfitted to the training data.



            Decision trees are powerful classifiers. Algorithms such as Bagging try to use powerful classifiers in order to achieve ensemble learning for finding a classifier that does not have high variance. One way can be ignoring some features and using the others, Random Forest, in order to find the best features which can generalize well. The other can be using choosing random training data for training each decision tree and after that put it that again inside the training data, bootstrapping.



            The reason that decision trees can overfit is due to their VC. Although it is not infinite, unlike 1-NN, it is very large which leads to overfitting. It simply means you have to provide multiple numerous data in order not to overfit. For understanding VC dimension of decision trees, take a look at Are decision tree algorithms linear or nonlinear.






            share|improve this answer









            $endgroup$












            • $begingroup$
              "the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.
              $endgroup$
              – baxx
              Mar 28 at 18:29










            • $begingroup$
              @baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.
              $endgroup$
              – Vaalizaadeh
              Mar 28 at 19:34










            • $begingroup$
              if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)
              $endgroup$
              – baxx
              Mar 28 at 20:08










            • $begingroup$
              Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have
              $endgroup$
              – Vaalizaadeh
              Mar 28 at 20:13











            • $begingroup$
              intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.
              $endgroup$
              – Vaalizaadeh
              Mar 28 at 20:14















            1












            $begingroup$

            The point is that if your training data does not have the same input features with different labels which leads to $0$ Bayes error, the decision tree can learn it entirely and that can lead to overfitting also known as high variance. This is why people usually use pruning using cross-validation for avoiding the trees to get overfitted to the training data.



            Decision trees are powerful classifiers. Algorithms such as Bagging try to use powerful classifiers in order to achieve ensemble learning for finding a classifier that does not have high variance. One way can be ignoring some features and using the others, Random Forest, in order to find the best features which can generalize well. The other can be using choosing random training data for training each decision tree and after that put it that again inside the training data, bootstrapping.



            The reason that decision trees can overfit is due to their VC. Although it is not infinite, unlike 1-NN, it is very large which leads to overfitting. It simply means you have to provide multiple numerous data in order not to overfit. For understanding VC dimension of decision trees, take a look at Are decision tree algorithms linear or nonlinear.






            share|improve this answer









            $endgroup$












            • $begingroup$
              "the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.
              $endgroup$
              – baxx
              Mar 28 at 18:29










            • $begingroup$
              @baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.
              $endgroup$
              – Vaalizaadeh
              Mar 28 at 19:34










            • $begingroup$
              if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)
              $endgroup$
              – baxx
              Mar 28 at 20:08










            • $begingroup$
              Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have
              $endgroup$
              – Vaalizaadeh
              Mar 28 at 20:13











            • $begingroup$
              intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.
              $endgroup$
              – Vaalizaadeh
              Mar 28 at 20:14













            1












            1








            1





            $begingroup$

            The point is that if your training data does not have the same input features with different labels which leads to $0$ Bayes error, the decision tree can learn it entirely and that can lead to overfitting also known as high variance. This is why people usually use pruning using cross-validation for avoiding the trees to get overfitted to the training data.



            Decision trees are powerful classifiers. Algorithms such as Bagging try to use powerful classifiers in order to achieve ensemble learning for finding a classifier that does not have high variance. One way can be ignoring some features and using the others, Random Forest, in order to find the best features which can generalize well. The other can be using choosing random training data for training each decision tree and after that put it that again inside the training data, bootstrapping.



            The reason that decision trees can overfit is due to their VC. Although it is not infinite, unlike 1-NN, it is very large which leads to overfitting. It simply means you have to provide multiple numerous data in order not to overfit. For understanding VC dimension of decision trees, take a look at Are decision tree algorithms linear or nonlinear.






            share|improve this answer









            $endgroup$



            The point is that if your training data does not have the same input features with different labels which leads to $0$ Bayes error, the decision tree can learn it entirely and that can lead to overfitting also known as high variance. This is why people usually use pruning using cross-validation for avoiding the trees to get overfitted to the training data.



            Decision trees are powerful classifiers. Algorithms such as Bagging try to use powerful classifiers in order to achieve ensemble learning for finding a classifier that does not have high variance. One way can be ignoring some features and using the others, Random Forest, in order to find the best features which can generalize well. The other can be using choosing random training data for training each decision tree and after that put it that again inside the training data, bootstrapping.



            The reason that decision trees can overfit is due to their VC. Although it is not infinite, unlike 1-NN, it is very large which leads to overfitting. It simply means you have to provide multiple numerous data in order not to overfit. For understanding VC dimension of decision trees, take a look at Are decision tree algorithms linear or nonlinear.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Mar 28 at 18:20









            VaalizaadehVaalizaadeh

            7,55062263




            7,55062263











            • $begingroup$
              "the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.
              $endgroup$
              – baxx
              Mar 28 at 18:29










            • $begingroup$
              @baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.
              $endgroup$
              – Vaalizaadeh
              Mar 28 at 19:34










            • $begingroup$
              if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)
              $endgroup$
              – baxx
              Mar 28 at 20:08










            • $begingroup$
              Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have
              $endgroup$
              – Vaalizaadeh
              Mar 28 at 20:13











            • $begingroup$
              intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.
              $endgroup$
              – Vaalizaadeh
              Mar 28 at 20:14
















            • $begingroup$
              "the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.
              $endgroup$
              – baxx
              Mar 28 at 18:29










            • $begingroup$
              @baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.
              $endgroup$
              – Vaalizaadeh
              Mar 28 at 19:34










            • $begingroup$
              if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)
              $endgroup$
              – baxx
              Mar 28 at 20:08










            • $begingroup$
              Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have
              $endgroup$
              – Vaalizaadeh
              Mar 28 at 20:13











            • $begingroup$
              intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.
              $endgroup$
              – Vaalizaadeh
              Mar 28 at 20:14















            $begingroup$
            "the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.
            $endgroup$
            – baxx
            Mar 28 at 18:29




            $begingroup$
            "the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.
            $endgroup$
            – baxx
            Mar 28 at 18:29












            $begingroup$
            @baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.
            $endgroup$
            – Vaalizaadeh
            Mar 28 at 19:34




            $begingroup$
            @baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.
            $endgroup$
            – Vaalizaadeh
            Mar 28 at 19:34












            $begingroup$
            if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)
            $endgroup$
            – baxx
            Mar 28 at 20:08




            $begingroup$
            if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)
            $endgroup$
            – baxx
            Mar 28 at 20:08












            $begingroup$
            Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have
            $endgroup$
            – Vaalizaadeh
            Mar 28 at 20:13





            $begingroup$
            Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have
            $endgroup$
            – Vaalizaadeh
            Mar 28 at 20:13













            $begingroup$
            intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.
            $endgroup$
            – Vaalizaadeh
            Mar 28 at 20:14




            $begingroup$
            intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.
            $endgroup$
            – Vaalizaadeh
            Mar 28 at 20:14











            1












            $begingroup$

            It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.



            A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.



            Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:




            IF



            1. player X is on the field AND

            2. team A has a home game AND

            3. the weather is sunny AND

            4. the number of attending fans >= 26000 AND

            5. it is past 3pm

            THEN team A wins.




            If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.



            Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).



            Linear regression, for example, would not be so sensitive to a small change because it is limited ("biased" -> see bias-variance tradeoff) to linear relationships and cannot represent sudden changes from 25999 to 26000 fans.



            That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.



            (See e.g. here for more on how random forests can help with this further.)






            share|improve this answer









            $endgroup$

















              1












              $begingroup$

              It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.



              A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.



              Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:




              IF



              1. player X is on the field AND

              2. team A has a home game AND

              3. the weather is sunny AND

              4. the number of attending fans >= 26000 AND

              5. it is past 3pm

              THEN team A wins.




              If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.



              Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).



              Linear regression, for example, would not be so sensitive to a small change because it is limited ("biased" -> see bias-variance tradeoff) to linear relationships and cannot represent sudden changes from 25999 to 26000 fans.



              That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.



              (See e.g. here for more on how random forests can help with this further.)






              share|improve this answer









              $endgroup$















                1












                1








                1





                $begingroup$

                It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.



                A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.



                Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:




                IF



                1. player X is on the field AND

                2. team A has a home game AND

                3. the weather is sunny AND

                4. the number of attending fans >= 26000 AND

                5. it is past 3pm

                THEN team A wins.




                If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.



                Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).



                Linear regression, for example, would not be so sensitive to a small change because it is limited ("biased" -> see bias-variance tradeoff) to linear relationships and cannot represent sudden changes from 25999 to 26000 fans.



                That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.



                (See e.g. here for more on how random forests can help with this further.)






                share|improve this answer









                $endgroup$



                It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.



                A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.



                Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:




                IF



                1. player X is on the field AND

                2. team A has a home game AND

                3. the weather is sunny AND

                4. the number of attending fans >= 26000 AND

                5. it is past 3pm

                THEN team A wins.




                If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.



                Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).



                Linear regression, for example, would not be so sensitive to a small change because it is limited ("biased" -> see bias-variance tradeoff) to linear relationships and cannot represent sudden changes from 25999 to 26000 fans.



                That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.



                (See e.g. here for more on how random forests can help with this further.)







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 28 at 21:56









                oW_oW_

                3,306933




                3,306933



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48166%2fwhy-can-decision-trees-have-a-high-amount-of-variance%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                    Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                    Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High