Wrong calculation of feature importance of decision tree in R Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow to interpret a decision tree correctly?Should we convert independent continous variables (features) to categorical variable before using decision tree like classifier?Fit Decision Tree to Gradient Boosted Trees for InterpretabilityQ: xgboost regressor training on a large number of indicator variables results in same prediction for all rows in testMax depth for a decision tree in sklearnDecision tree orderingDecision Tree : PlayTennis Data SetHow are boosted decision stumps different from a decision tree?Isolation Forest Feature Importancehow does splitting occur at a node in a decision-tree with non-categorical data?
What is the ongoing value of the Kanban board to the developers as opposed to management
State of Debian Stable (Stretch) Repository between time of two versions (e.g. 9.8 to 9.9)
How do I check if a string is entirely made of the same substring?
How long can a nation maintain a technological edge over the rest of the world?
change doc string summary of a function on the fly
What was Apollo 13's "Little Jolt" after MECO?
Does using the Inspiration rules for character defects encourage My Guy Syndrome?
Is it acceptable to use working hours to read general interest books?
Did war bonds have better investment alternatives during WWII?
Arriving in Atlanta after US Preclearance in Dublin. Will I go through TSA security in Atlanta to transfer to a connecting flight?
RIP Packet Format
How did Elite on the NES work?
Why doesn't the university give past final exams' answers?
How to keep bees out of canned beverages?
What do you call an IPA symbol that lacks a name (e.g. ɲ)?
Are `mathfont` and `mathspec` intended for same purpose?
Trigonometric and Exponential Integration
France's Public Holidays' Puzzle
Where to find documentation for `whois` command options?
My admission is revoked after accepting the admission offer
Delete Strings name, John, John Doe, Doe to name, John Doe
How do I deal with an erroneously large refund?
Array Dynamic resize in heap
What is the term for a person whose job is to place products on shelves in stores?
Wrong calculation of feature importance of decision tree in R
Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsHow to interpret a decision tree correctly?Should we convert independent continous variables (features) to categorical variable before using decision tree like classifier?Fit Decision Tree to Gradient Boosted Trees for InterpretabilityQ: xgboost regressor training on a large number of indicator variables results in same prediction for all rows in testMax depth for a decision tree in sklearnDecision tree orderingDecision Tree : PlayTennis Data SetHow are boosted decision stumps different from a decision tree?Isolation Forest Feature Importancehow does splitting occur at a node in a decision-tree with non-categorical data?
$begingroup$
I trained decision tree both in python and R, but I think the way feature importance is calculated in R may be wrong. Following is the sample code which you can use to reproduce the problem. Let's say I am predicting Income of population of 1000 based on gender and country.
x = data.frame(gender=sample(c("M","F"),n,T), country=sample(c("A","I"),n,T))
x$income = ifelse(x$gender=="M", rnorm(n, 100, 10), rnorm(n, 80, 10))
x$income = x$income + ifelse(x$country=="A", rnorm(n, 100, 10), rnorm(n, 80, 10))
write.csv(x, "data.csv")
Then lets fit a decision tree in R with max depth of 1.
fit = rpart(income~., data = x, control=rpart.control(maxdepth=1))
caret::varImp(fit)
fit
I get the following feature importance
country 0.2507630, and
gender 0.2424981
For the tree split only at country
1) root 1000 407373.4 180.5759
2) country=I 481 147999.6 170.0772 *
3) country=A 519 157219.6 190.3060 *
When I try again with Max depth of 2, I get feature importance as
country 0.2507630, and
gender 0.8874599
For the tree split first at country as before and then at gender
1) root 1000 407373.40 180.5759
2) country=I 481 147999.60 170.0772
4) gender=F 232 40082.49 159.2805 *
5) gender=M 249 55676.09 180.1367 *
3) country=A 519 157219.60 190.3060
6) gender=F 248 57546.77 180.4749 *
7) gender=M 271 53767.73 199.3028 *
However, if I run similar code in python
from io import StringIO
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree.export import export_graphviz
from IPython.display import Image
from sklearn import tree
import pydot
import pandas as pd
data = pd.read_csv("data.csv")
dtree=DecisionTreeRegressor(max_depth= 1)
X = data[["gender", "country" ]]
X["gender"] = X["gender"] == 'M'
X["country"] = X["country"] == 'A'
y = data[['income']]
dtree.fit(X,y)
# Export as dot file
export_graphviz(dtree, out_file='tree.dot',
feature_names = X.columns,filled=True, rounded=True,
special_characters=True)
(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')
for max depth of 1 I get feature importance as
gender 0, and
country 1
and for max depth of two
gender 0.49, and
country 0.51
Now I have following two questions
1) In R, when I select max depth = 1 and split happened only at country column, then why it still gives feature importance value for gender. Even though gender is not even part of the final model. For e.g. in python it gives gender as 0 var importance.
2) Secondly, why in R, the feature importance of gender column became greater than that of country column? As the country column was more important as the initial split happened at country and not at gender. Like the values we got for python.
One of my colleague pointed out that in R, the feature importance of each column is calculated at each split. For e.g. the some feature importance value of gender and country will be calculated at first split. Then again this happens at second split. But since we already had a split based on country there will not be any information gain based on country but would be there for gender. And in the end all these importance are summed. Hence we get more value for the ones that were used at a later stage for split.
python r decision-trees
$endgroup$
add a comment |
$begingroup$
I trained decision tree both in python and R, but I think the way feature importance is calculated in R may be wrong. Following is the sample code which you can use to reproduce the problem. Let's say I am predicting Income of population of 1000 based on gender and country.
x = data.frame(gender=sample(c("M","F"),n,T), country=sample(c("A","I"),n,T))
x$income = ifelse(x$gender=="M", rnorm(n, 100, 10), rnorm(n, 80, 10))
x$income = x$income + ifelse(x$country=="A", rnorm(n, 100, 10), rnorm(n, 80, 10))
write.csv(x, "data.csv")
Then lets fit a decision tree in R with max depth of 1.
fit = rpart(income~., data = x, control=rpart.control(maxdepth=1))
caret::varImp(fit)
fit
I get the following feature importance
country 0.2507630, and
gender 0.2424981
For the tree split only at country
1) root 1000 407373.4 180.5759
2) country=I 481 147999.6 170.0772 *
3) country=A 519 157219.6 190.3060 *
When I try again with Max depth of 2, I get feature importance as
country 0.2507630, and
gender 0.8874599
For the tree split first at country as before and then at gender
1) root 1000 407373.40 180.5759
2) country=I 481 147999.60 170.0772
4) gender=F 232 40082.49 159.2805 *
5) gender=M 249 55676.09 180.1367 *
3) country=A 519 157219.60 190.3060
6) gender=F 248 57546.77 180.4749 *
7) gender=M 271 53767.73 199.3028 *
However, if I run similar code in python
from io import StringIO
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree.export import export_graphviz
from IPython.display import Image
from sklearn import tree
import pydot
import pandas as pd
data = pd.read_csv("data.csv")
dtree=DecisionTreeRegressor(max_depth= 1)
X = data[["gender", "country" ]]
X["gender"] = X["gender"] == 'M'
X["country"] = X["country"] == 'A'
y = data[['income']]
dtree.fit(X,y)
# Export as dot file
export_graphviz(dtree, out_file='tree.dot',
feature_names = X.columns,filled=True, rounded=True,
special_characters=True)
(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')
for max depth of 1 I get feature importance as
gender 0, and
country 1
and for max depth of two
gender 0.49, and
country 0.51
Now I have following two questions
1) In R, when I select max depth = 1 and split happened only at country column, then why it still gives feature importance value for gender. Even though gender is not even part of the final model. For e.g. in python it gives gender as 0 var importance.
2) Secondly, why in R, the feature importance of gender column became greater than that of country column? As the country column was more important as the initial split happened at country and not at gender. Like the values we got for python.
One of my colleague pointed out that in R, the feature importance of each column is calculated at each split. For e.g. the some feature importance value of gender and country will be calculated at first split. Then again this happens at second split. But since we already had a split based on country there will not be any information gain based on country but would be there for gender. And in the end all these importance are summed. Hence we get more value for the ones that were used at a later stage for split.
python r decision-trees
$endgroup$
add a comment |
$begingroup$
I trained decision tree both in python and R, but I think the way feature importance is calculated in R may be wrong. Following is the sample code which you can use to reproduce the problem. Let's say I am predicting Income of population of 1000 based on gender and country.
x = data.frame(gender=sample(c("M","F"),n,T), country=sample(c("A","I"),n,T))
x$income = ifelse(x$gender=="M", rnorm(n, 100, 10), rnorm(n, 80, 10))
x$income = x$income + ifelse(x$country=="A", rnorm(n, 100, 10), rnorm(n, 80, 10))
write.csv(x, "data.csv")
Then lets fit a decision tree in R with max depth of 1.
fit = rpart(income~., data = x, control=rpart.control(maxdepth=1))
caret::varImp(fit)
fit
I get the following feature importance
country 0.2507630, and
gender 0.2424981
For the tree split only at country
1) root 1000 407373.4 180.5759
2) country=I 481 147999.6 170.0772 *
3) country=A 519 157219.6 190.3060 *
When I try again with Max depth of 2, I get feature importance as
country 0.2507630, and
gender 0.8874599
For the tree split first at country as before and then at gender
1) root 1000 407373.40 180.5759
2) country=I 481 147999.60 170.0772
4) gender=F 232 40082.49 159.2805 *
5) gender=M 249 55676.09 180.1367 *
3) country=A 519 157219.60 190.3060
6) gender=F 248 57546.77 180.4749 *
7) gender=M 271 53767.73 199.3028 *
However, if I run similar code in python
from io import StringIO
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree.export import export_graphviz
from IPython.display import Image
from sklearn import tree
import pydot
import pandas as pd
data = pd.read_csv("data.csv")
dtree=DecisionTreeRegressor(max_depth= 1)
X = data[["gender", "country" ]]
X["gender"] = X["gender"] == 'M'
X["country"] = X["country"] == 'A'
y = data[['income']]
dtree.fit(X,y)
# Export as dot file
export_graphviz(dtree, out_file='tree.dot',
feature_names = X.columns,filled=True, rounded=True,
special_characters=True)
(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')
for max depth of 1 I get feature importance as
gender 0, and
country 1
and for max depth of two
gender 0.49, and
country 0.51
Now I have following two questions
1) In R, when I select max depth = 1 and split happened only at country column, then why it still gives feature importance value for gender. Even though gender is not even part of the final model. For e.g. in python it gives gender as 0 var importance.
2) Secondly, why in R, the feature importance of gender column became greater than that of country column? As the country column was more important as the initial split happened at country and not at gender. Like the values we got for python.
One of my colleague pointed out that in R, the feature importance of each column is calculated at each split. For e.g. the some feature importance value of gender and country will be calculated at first split. Then again this happens at second split. But since we already had a split based on country there will not be any information gain based on country but would be there for gender. And in the end all these importance are summed. Hence we get more value for the ones that were used at a later stage for split.
python r decision-trees
$endgroup$
I trained decision tree both in python and R, but I think the way feature importance is calculated in R may be wrong. Following is the sample code which you can use to reproduce the problem. Let's say I am predicting Income of population of 1000 based on gender and country.
x = data.frame(gender=sample(c("M","F"),n,T), country=sample(c("A","I"),n,T))
x$income = ifelse(x$gender=="M", rnorm(n, 100, 10), rnorm(n, 80, 10))
x$income = x$income + ifelse(x$country=="A", rnorm(n, 100, 10), rnorm(n, 80, 10))
write.csv(x, "data.csv")
Then lets fit a decision tree in R with max depth of 1.
fit = rpart(income~., data = x, control=rpart.control(maxdepth=1))
caret::varImp(fit)
fit
I get the following feature importance
country 0.2507630, and
gender 0.2424981
For the tree split only at country
1) root 1000 407373.4 180.5759
2) country=I 481 147999.6 170.0772 *
3) country=A 519 157219.6 190.3060 *
When I try again with Max depth of 2, I get feature importance as
country 0.2507630, and
gender 0.8874599
For the tree split first at country as before and then at gender
1) root 1000 407373.40 180.5759
2) country=I 481 147999.60 170.0772
4) gender=F 232 40082.49 159.2805 *
5) gender=M 249 55676.09 180.1367 *
3) country=A 519 157219.60 190.3060
6) gender=F 248 57546.77 180.4749 *
7) gender=M 271 53767.73 199.3028 *
However, if I run similar code in python
from io import StringIO
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree.export import export_graphviz
from IPython.display import Image
from sklearn import tree
import pydot
import pandas as pd
data = pd.read_csv("data.csv")
dtree=DecisionTreeRegressor(max_depth= 1)
X = data[["gender", "country" ]]
X["gender"] = X["gender"] == 'M'
X["country"] = X["country"] == 'A'
y = data[['income']]
dtree.fit(X,y)
# Export as dot file
export_graphviz(dtree, out_file='tree.dot',
feature_names = X.columns,filled=True, rounded=True,
special_characters=True)
(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')
for max depth of 1 I get feature importance as
gender 0, and
country 1
and for max depth of two
gender 0.49, and
country 0.51
Now I have following two questions
1) In R, when I select max depth = 1 and split happened only at country column, then why it still gives feature importance value for gender. Even though gender is not even part of the final model. For e.g. in python it gives gender as 0 var importance.
2) Secondly, why in R, the feature importance of gender column became greater than that of country column? As the country column was more important as the initial split happened at country and not at gender. Like the values we got for python.
One of my colleague pointed out that in R, the feature importance of each column is calculated at each split. For e.g. the some feature importance value of gender and country will be calculated at first split. Then again this happens at second split. But since we already had a split based on country there will not be any information gain based on country but would be there for gender. And in the end all these importance are summed. Hence we get more value for the ones that were used at a later stage for split.
python r decision-trees
python r decision-trees
asked Apr 5 at 17:32
MNAMNA
12
12
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48705%2fwrong-calculation-of-feature-importance-of-decision-tree-in-r%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48705%2fwrong-calculation-of-feature-importance-of-decision-tree-in-r%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown