Why can decision trees have a high amount of variance2019 Community Moderator ElectionAre decision tree algorithms linear or nonlinearWhy do we pick random features in random forestDistributed Scalable Decision TreesAggregating Decision TreesHow can decision trees be tuned for non-symmetrical loss?Pruning and parameter reduction for decision treesOOB decision function doesn't match prediction in scikit-learn RandomForestDecision trees and Curse of DimensionalityRegression Decision Tree - Normalize or Split into Ranges a continuos featureWhy Decision trees performs better than logistic regressionLinear machine learning algorithms “often” have high bias/low variance?Why underfitting is called high bias and overfitting is called high variance?

How to move the player while also allowing forces to affect it

What does 'script /dev/null' do?

Springs with some finite mass

Is std::next for vector O(n) or O(1)?

Why is my log file so massive? 22gb. I am running log backups

What is GPS' 19 year rollover and does it present a cybersecurity issue?

Can I legally use front facing blue light in the UK?

Can a planet have a different gravitational pull depending on its location in orbit around its sun?

What is the steepest angle that a canal can be traversable without locks?

Why was the "bread communication" in the arena of Catching Fire left out in the movie?

I see my dog run

How to make payment on the internet without leaving a money trail?

What are the advantages and disadvantages of running one shots compared to campaigns?

Is domain driven design an anti-SQL pattern?

"listening to me about as much as you're listening to this pole here"

Shall I use personal or official e-mail account when registering to external websites for work purpose?

Re-submission of rejected manuscript without informing co-authors

Finding files for which a command fails

A poker game description that does not feel gimmicky

Are there any other methods to apply to solving simultaneous equations?

extract characters between two commas?

Inflated grade on resume at previous job, might former employer tell new employer?

How to manage monthly salary

Why isn't airport relocation done gradually?

Why can decision trees have a high amount of variance

2019 Community Moderator ElectionAre decision tree algorithms linear or nonlinearWhy do we pick random features in random forestDistributed Scalable Decision TreesAggregating Decision TreesHow can decision trees be tuned for non-symmetrical loss?Pruning and parameter reduction for decision treesOOB decision function doesn't match prediction in scikit-learn RandomForestDecision trees and Curse of DimensionalityRegression Decision Tree - Normalize or Split into Ranges a continuos featureWhy Decision trees performs better than logistic regressionLinear machine learning algorithms “often” have high bias/low variance?Why underfitting is called high bias and overfitting is called high variance?

I've heard that decision trees can have a high amount of variance, and that for a data set $D$ split into test/train the decision tree could be quite different depending on how the data was split. Apparently, this provides motivation for algorithms such as Random Forest.

Is this correct? Why does a decision tree suffer from high variability?

edit

just to note that I don't really follow the current answer
and haven't been able to solve that in the comments.

edited Mar 28 at 20:49

asked Mar 28 at 17:57

baxx

1314

add a comment |

Is this correct? Why does a decision tree suffer from high variability?

edit

just to note that I don't really follow the current answer
and haven't been able to solve that in the comments.

edited Mar 28 at 20:49

asked Mar 28 at 17:57

baxx

1314

add a comment |

Is this correct? Why does a decision tree suffer from high variability?

edit

just to note that I don't really follow the current answer
and haven't been able to solve that in the comments.

edited Mar 28 at 20:49

asked Mar 28 at 17:57

baxx

1314

Is this correct? Why does a decision tree suffer from high variability?

edit

just to note that I don't really follow the current answer
and haven't been able to solve that in the comments.

machine-learning classification decision-trees training variance

edited Mar 28 at 20:49

asked Mar 28 at 17:57

baxx

1314

edited Mar 28 at 20:49

asked Mar 28 at 17:57

baxx

1314

edited Mar 28 at 20:49

asked Mar 28 at 17:57

baxx

1314

asked Mar 28 at 17:57

baxx

1314

asked Mar 28 at 17:57

baxx

1314

add a comment |

2 Answers
2

active

oldest

votes

The point is that if your training data does not have the same input features with different labels which leads to $0$ Bayes error, the decision tree can learn it entirely and that can lead to overfitting also known as high variance. This is why people usually use pruning using cross-validation for avoiding the trees to get overfitted to the training data.

Decision trees are powerful classifiers. Algorithms such as Bagging try to use powerful classifiers in order to achieve ensemble learning for finding a classifier that does not have high variance. One way can be ignoring some features and using the others, Random Forest, in order to find the best features which can generalize well. The other can be using choosing random training data for training each decision tree and after that put it that again inside the training data, bootstrapping.

The reason that decision trees can overfit is due to their VC. Although it is not infinite, unlike 1-NN, it is very large which leads to overfitting. It simply means you have to provide multiple numerous data in order not to overfit. For understanding VC dimension of decision trees, take a look at Are decision tree algorithms linear or nonlinear.

answered Mar 28 at 18:20

Vaalizaadeh

7,55062263

$begingroup$
"the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.
$endgroup$
– baxx
Mar 28 at 18:29

$begingroup$
@baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.
$endgroup$
– Vaalizaadeh
Mar 28 at 19:34

$begingroup$
if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)
$endgroup$
– baxx
Mar 28 at 20:08

$begingroup$
Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have
$endgroup$
– Vaalizaadeh
Mar 28 at 20:13

$begingroup$
intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.
$endgroup$
– Vaalizaadeh
Mar 28 at 20:14

|
show 10 more comments

It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.

A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.

Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:

IF

player X is on the field AND

team A has a home game AND

the weather is sunny AND

the number of attending fans >= 26000 AND

it is past 3pm

THEN team A wins.

If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.

Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).

Linear regression, for example, would not be so sensitive to a small change because it is limited ("biased" -> see bias-variance tradeoff) to linear relationships and cannot represent sudden changes from 25999 to 26000 fans.

That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.

(See e.g. here for more on how random forests can help with this further.)

answered Mar 28 at 21:56

oW_

3,306933

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48166%2fwhy-can-decision-trees-have-a-high-amount-of-variance%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

answered Mar 28 at 18:20

Vaalizaadeh

7,55062263

$begingroup$
"the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.
$endgroup$
– baxx
Mar 28 at 18:29

$begingroup$
@baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.
$endgroup$
– Vaalizaadeh
Mar 28 at 19:34

$begingroup$
if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)
$endgroup$
– baxx
Mar 28 at 20:08

$begingroup$
Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have
$endgroup$
– Vaalizaadeh
Mar 28 at 20:13

$begingroup$
intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.
$endgroup$
– Vaalizaadeh
Mar 28 at 20:14

|
show 10 more comments

answered Mar 28 at 18:20

Vaalizaadeh

7,55062263

$begingroup$
"the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.
$endgroup$
– baxx
Mar 28 at 18:29

$begingroup$
@baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.
$endgroup$
– Vaalizaadeh
Mar 28 at 19:34

$begingroup$
if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)
$endgroup$
– baxx
Mar 28 at 20:08

$begingroup$
Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have
$endgroup$
– Vaalizaadeh
Mar 28 at 20:13

$begingroup$
intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.
$endgroup$
– Vaalizaadeh
Mar 28 at 20:14

|
show 10 more comments

answered Mar 28 at 18:20

Vaalizaadeh

7,55062263

answered Mar 28 at 18:20

Vaalizaadeh

7,55062263

answered Mar 28 at 18:20

Vaalizaadeh

7,55062263

answered Mar 28 at 18:20

Vaalizaadeh

7,55062263

answered Mar 28 at 18:20

Vaalizaadeh

7,55062263

$begingroup$
"the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.
$endgroup$
– baxx
Mar 28 at 18:29

$begingroup$
@baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.
$endgroup$
– Vaalizaadeh
Mar 28 at 19:34

$begingroup$
if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)
$endgroup$
– baxx
Mar 28 at 20:08

$begingroup$
Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have
$endgroup$
– Vaalizaadeh
Mar 28 at 20:13

$begingroup$
intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.
$endgroup$
– Vaalizaadeh
Mar 28 at 20:14

|
show 10 more comments

$begingroup$
"the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.
$endgroup$
– baxx
Mar 28 at 18:29

$begingroup$
@baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.
$endgroup$
– Vaalizaadeh
Mar 28 at 19:34

$begingroup$
if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)
$endgroup$
– baxx
Mar 28 at 20:08

$begingroup$
Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have
$endgroup$
– Vaalizaadeh
Mar 28 at 20:13

$begingroup$
intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.
$endgroup$
– Vaalizaadeh
Mar 28 at 20:14

"the same input features with different labels which leads to 0 Bayes error", I'm not sure what you mean by this.

– baxx
Mar 28 at 18:29

@baxx I meant your training data of different classes in the current feature space do not have intersection; namely, in the current space the distribution of each class does not have intersection with the others.

– Vaalizaadeh
Mar 28 at 19:34

if there's a way to explain this in more "plain english" then I can follow, currently this answer is a bit abstract though. The data of different classes (are you just referring to variables here?) in the current feature space (is this the data set?) do not have intersection (not sure what you're referring to there, they're mutually exclusive? Why wouldn't they be if they're different variables?)

– baxx
Mar 28 at 20:08

Suppose your input feature space is in $R$ which you have a variable that can take any real value. This number can be temperature for instance, suppose it can take any value and forget about -273. Now you have an output label which can take cold and hot. Suppose you have a training set which consists of opinions of different people. Different people may have different opinions. Consequently, in the current feature space, which consists of only the temperature, you may have 4 with label cold and hot. this means if you plot the histogram of each class, cold and hot, you have

– Vaalizaadeh
Mar 28 at 20:13

intersection. This means even the best possible brain cannot have $100%$ accuracy let alone an ML algorithm.

– Vaalizaadeh
Mar 28 at 20:14

|
show 10 more comments

It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.

A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.

Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:

IF

player X is on the field AND

team A has a home game AND

the weather is sunny AND

the number of attending fans >= 26000 AND

it is past 3pm

THEN team A wins.

If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.

Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).

That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.

(See e.g. here for more on how random forests can help with this further.)

answered Mar 28 at 21:56

oW_

3,306933

add a comment |

It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.

A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.

Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:

IF

player X is on the field AND

team A has a home game AND

the weather is sunny AND

the number of attending fans >= 26000 AND

it is past 3pm

THEN team A wins.

If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.

Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).

That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.

(See e.g. here for more on how random forests can help with this further.)

answered Mar 28 at 21:56

oW_

3,306933

add a comment |

It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.

A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.

Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:

IF

player X is on the field AND

team A has a home game AND

the weather is sunny AND

the number of attending fans >= 26000 AND

it is past 3pm

THEN team A wins.

If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.

Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).

That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.

(See e.g. here for more on how random forests can help with this further.)

answered Mar 28 at 21:56

oW_

3,306933

It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.

A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.

Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:

IF

player X is on the field AND

team A has a home game AND

the weather is sunny AND

the number of attending fans >= 26000 AND

it is past 3pm

THEN team A wins.

If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.

Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).

That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.

(See e.g. here for more on how random forests can help with this further.)

answered Mar 28 at 21:56

oW_

3,306933

answered Mar 28 at 21:56

oW_

3,306933

answered Mar 28 at 21:56

oW_

3,306933

answered Mar 28 at 21:56

oW_

3,306933

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

edit

edit

edit

edit

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

edit

edit

edit

edit

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2