Wrong calculation of feature importance of decision tree in R Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow to interpret a decision tree correctly?Should we convert independent continous variables (features) to categorical variable before using decision tree like classifier?Fit Decision Tree to Gradient Boosted Trees for InterpretabilityQ: xgboost regressor training on a large number of indicator variables results in same prediction for all rows in testMax depth for a decision tree in sklearnDecision tree orderingDecision Tree : PlayTennis Data SetHow are boosted decision stumps different from a decision tree?Isolation Forest Feature Importancehow does splitting occur at a node in a decision-tree with non-categorical data?

What is the ongoing value of the Kanban board to the developers as opposed to management

State of Debian Stable (Stretch) Repository between time of two versions (e.g. 9.8 to 9.9)

How do I check if a string is entirely made of the same substring?

How long can a nation maintain a technological edge over the rest of the world?

change doc string summary of a function on the fly

What was Apollo 13's "Little Jolt" after MECO?

Does using the Inspiration rules for character defects encourage My Guy Syndrome?

Is it acceptable to use working hours to read general interest books?

Did war bonds have better investment alternatives during WWII?

Arriving in Atlanta after US Preclearance in Dublin. Will I go through TSA security in Atlanta to transfer to a connecting flight?

RIP Packet Format

How did Elite on the NES work?

Why doesn't the university give past final exams' answers?

How to keep bees out of canned beverages?

What do you call an IPA symbol that lacks a name (e.g. ɲ)?

Are `mathfont` and `mathspec` intended for same purpose?

Trigonometric and Exponential Integration

France's Public Holidays' Puzzle

Where to find documentation for `whois` command options?

My admission is revoked after accepting the admission offer

Delete Strings name, John, John Doe, Doe to name, John Doe

How do I deal with an erroneously large refund?

Array Dynamic resize in heap

What is the term for a person whose job is to place products on shelves in stores?

Wrong calculation of feature importance of decision tree in R

Unicorn Meta Zoo #1: Why another podcast?

Announcing the arrival of Valued Associate #679: Cesar Manara

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsHow to interpret a decision tree correctly?Should we convert independent continous variables (features) to categorical variable before using decision tree like classifier?Fit Decision Tree to Gradient Boosted Trees for InterpretabilityQ: xgboost regressor training on a large number of indicator variables results in same prediction for all rows in testMax depth for a decision tree in sklearnDecision tree orderingDecision Tree : PlayTennis Data SetHow are boosted decision stumps different from a decision tree?Isolation Forest Feature Importancehow does splitting occur at a node in a decision-tree with non-categorical data?

I trained decision tree both in python and R, but I think the way feature importance is calculated in R may be wrong. Following is the sample code which you can use to reproduce the problem. Let's say I am predicting Income of population of 1000 based on gender and country.

x = data.frame(gender=sample(c("M","F"),n,T), country=sample(c("A","I"),n,T))
x$income = ifelse(x$gender=="M", rnorm(n, 100, 10), rnorm(n, 80, 10))
x$income = x$income + ifelse(x$country=="A", rnorm(n, 100, 10), rnorm(n, 80, 10))
write.csv(x, "data.csv")

Then lets fit a decision tree in R with max depth of 1.

fit = rpart(income~., data = x, control=rpart.control(maxdepth=1))
caret::varImp(fit)
fit

I get the following feature importance

country 0.2507630, and
gender 0.2424981

For the tree split only at country

1) root 1000 407373.4 180.5759 
 2) country=I 481 147999.6 170.0772 * 
 3) country=A 519 157219.6 190.3060 *

When I try again with Max depth of 2, I get feature importance as

country 0.2507630, and
gender 0.8874599

For the tree split first at country as before and then at gender

1) root 1000 407373.40 180.5759 
 2) country=I 481 147999.60 170.0772 
 4) gender=F 232 40082.49 159.2805 * 
 5) gender=M 249 55676.09 180.1367 * 
 3) country=A 519 157219.60 190.3060 
 6) gender=F 248 57546.77 180.4749 * 
 7) gender=M 271 53767.73 199.3028 *

However, if I run similar code in python

from io import StringIO
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree.export import export_graphviz
from IPython.display import Image 
from sklearn import tree
import pydot
import pandas as pd

data = pd.read_csv("data.csv")

dtree=DecisionTreeRegressor(max_depth= 1)
X = data[["gender", "country" ]]
X["gender"] = X["gender"] == 'M'
X["country"] = X["country"] == 'A'
y = data[['income']]
dtree.fit(X,y)

# Export as dot file
export_graphviz(dtree, out_file='tree.dot', 
 feature_names = X.columns,filled=True, rounded=True,
 special_characters=True)

(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

for max depth of 1 I get feature importance as

gender 0, and 
country 1

and for max depth of two

gender 0.49, and 
country 0.51

Now I have following two questions

1) In R, when I select max depth = 1 and split happened only at country column, then why it still gives feature importance value for gender. Even though gender is not even part of the final model. For e.g. in python it gives gender as 0 var importance.

2) Secondly, why in R, the feature importance of gender column became greater than that of country column? As the country column was more important as the initial split happened at country and not at gender. Like the values we got for python.

One of my colleague pointed out that in R, the feature importance of each column is calculated at each split. For e.g. the some feature importance value of gender and country will be calculated at first split. Then again this happens at second split. But since we already had a split based on country there will not be any information gain based on country but would be there for gender. And in the end all these importance are summed. Hence we get more value for the ones that were used at a later stage for split.

asked Apr 5 at 17:32

MNA

add a comment |

x = data.frame(gender=sample(c("M","F"),n,T), country=sample(c("A","I"),n,T))
x$income = ifelse(x$gender=="M", rnorm(n, 100, 10), rnorm(n, 80, 10))
x$income = x$income + ifelse(x$country=="A", rnorm(n, 100, 10), rnorm(n, 80, 10))
write.csv(x, "data.csv")

Then lets fit a decision tree in R with max depth of 1.

fit = rpart(income~., data = x, control=rpart.control(maxdepth=1))
caret::varImp(fit)
fit

I get the following feature importance

country 0.2507630, and
gender 0.2424981

For the tree split only at country

1) root 1000 407373.4 180.5759 
 2) country=I 481 147999.6 170.0772 * 
 3) country=A 519 157219.6 190.3060 *

When I try again with Max depth of 2, I get feature importance as

country 0.2507630, and
gender 0.8874599

For the tree split first at country as before and then at gender

1) root 1000 407373.40 180.5759 
 2) country=I 481 147999.60 170.0772 
 4) gender=F 232 40082.49 159.2805 * 
 5) gender=M 249 55676.09 180.1367 * 
 3) country=A 519 157219.60 190.3060 
 6) gender=F 248 57546.77 180.4749 * 
 7) gender=M 271 53767.73 199.3028 *

However, if I run similar code in python

from io import StringIO
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree.export import export_graphviz
from IPython.display import Image 
from sklearn import tree
import pydot
import pandas as pd

data = pd.read_csv("data.csv")

dtree=DecisionTreeRegressor(max_depth= 1)
X = data[["gender", "country" ]]
X["gender"] = X["gender"] == 'M'
X["country"] = X["country"] == 'A'
y = data[['income']]
dtree.fit(X,y)

# Export as dot file
export_graphviz(dtree, out_file='tree.dot', 
 feature_names = X.columns,filled=True, rounded=True,
 special_characters=True)

(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

for max depth of 1 I get feature importance as

gender 0, and 
country 1

and for max depth of two

gender 0.49, and 
country 0.51

asked Apr 5 at 17:32

MNA

add a comment |

x = data.frame(gender=sample(c("M","F"),n,T), country=sample(c("A","I"),n,T))
x$income = ifelse(x$gender=="M", rnorm(n, 100, 10), rnorm(n, 80, 10))
x$income = x$income + ifelse(x$country=="A", rnorm(n, 100, 10), rnorm(n, 80, 10))
write.csv(x, "data.csv")

Then lets fit a decision tree in R with max depth of 1.

fit = rpart(income~., data = x, control=rpart.control(maxdepth=1))
caret::varImp(fit)
fit

I get the following feature importance

country 0.2507630, and
gender 0.2424981

For the tree split only at country

1) root 1000 407373.4 180.5759 
 2) country=I 481 147999.6 170.0772 * 
 3) country=A 519 157219.6 190.3060 *

When I try again with Max depth of 2, I get feature importance as

country 0.2507630, and
gender 0.8874599

For the tree split first at country as before and then at gender

1) root 1000 407373.40 180.5759 
 2) country=I 481 147999.60 170.0772 
 4) gender=F 232 40082.49 159.2805 * 
 5) gender=M 249 55676.09 180.1367 * 
 3) country=A 519 157219.60 190.3060 
 6) gender=F 248 57546.77 180.4749 * 
 7) gender=M 271 53767.73 199.3028 *

However, if I run similar code in python

from io import StringIO
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree.export import export_graphviz
from IPython.display import Image 
from sklearn import tree
import pydot
import pandas as pd

data = pd.read_csv("data.csv")

dtree=DecisionTreeRegressor(max_depth= 1)
X = data[["gender", "country" ]]
X["gender"] = X["gender"] == 'M'
X["country"] = X["country"] == 'A'
y = data[['income']]
dtree.fit(X,y)

# Export as dot file
export_graphviz(dtree, out_file='tree.dot', 
 feature_names = X.columns,filled=True, rounded=True,
 special_characters=True)

(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

for max depth of 1 I get feature importance as

gender 0, and 
country 1

and for max depth of two

gender 0.49, and 
country 0.51

asked Apr 5 at 17:32

MNA

x = data.frame(gender=sample(c("M","F"),n,T), country=sample(c("A","I"),n,T))
x$income = ifelse(x$gender=="M", rnorm(n, 100, 10), rnorm(n, 80, 10))
x$income = x$income + ifelse(x$country=="A", rnorm(n, 100, 10), rnorm(n, 80, 10))
write.csv(x, "data.csv")

Then lets fit a decision tree in R with max depth of 1.

fit = rpart(income~., data = x, control=rpart.control(maxdepth=1))
caret::varImp(fit)
fit

I get the following feature importance

country 0.2507630, and
gender 0.2424981

For the tree split only at country

1) root 1000 407373.4 180.5759 
 2) country=I 481 147999.6 170.0772 * 
 3) country=A 519 157219.6 190.3060 *

When I try again with Max depth of 2, I get feature importance as

country 0.2507630, and
gender 0.8874599

For the tree split first at country as before and then at gender

1) root 1000 407373.40 180.5759 
 2) country=I 481 147999.60 170.0772 
 4) gender=F 232 40082.49 159.2805 * 
 5) gender=M 249 55676.09 180.1367 * 
 3) country=A 519 157219.60 190.3060 
 6) gender=F 248 57546.77 180.4749 * 
 7) gender=M 271 53767.73 199.3028 *

However, if I run similar code in python

from io import StringIO
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree.export import export_graphviz
from IPython.display import Image 
from sklearn import tree
import pydot
import pandas as pd

data = pd.read_csv("data.csv")

dtree=DecisionTreeRegressor(max_depth= 1)
X = data[["gender", "country" ]]
X["gender"] = X["gender"] == 'M'
X["country"] = X["country"] == 'A'
y = data[['income']]
dtree.fit(X,y)

# Export as dot file
export_graphviz(dtree, out_file='tree.dot', 
 feature_names = X.columns,filled=True, rounded=True,
 special_characters=True)

(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

for max depth of 1 I get feature importance as

gender 0, and 
country 1

and for max depth of two

gender 0.49, and 
country 0.51

python r decision-trees

asked Apr 5 at 17:32

MNA

asked Apr 5 at 17:32

MNA

asked Apr 5 at 17:32

MNA

asked Apr 5 at 17:32

MNA

asked Apr 5 at 17:32

MNA

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48705%2fwrong-calculation-of-feature-importance-of-decision-tree-in-r%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

IP,sdrVZMDI0 iAWizvmLAB

搜尋此網誌

Trjtdtk

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli