Wrong calculation of feature importance of decision tree in R Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow to interpret a decision tree correctly?Should we convert independent continous variables (features) to categorical variable before using decision tree like classifier?Fit Decision Tree to Gradient Boosted Trees for InterpretabilityQ: xgboost regressor training on a large number of indicator variables results in same prediction for all rows in testMax depth for a decision tree in sklearnDecision tree orderingDecision Tree : PlayTennis Data SetHow are boosted decision stumps different from a decision tree?Isolation Forest Feature Importancehow does splitting occur at a node in a decision-tree with non-categorical data?

What is the ongoing value of the Kanban board to the developers as opposed to management

State of Debian Stable (Stretch) Repository between time of two versions (e.g. 9.8 to 9.9)

How do I check if a string is entirely made of the same substring?

How long can a nation maintain a technological edge over the rest of the world?

change doc string summary of a function on the fly

What was Apollo 13's "Little Jolt" after MECO?

Does using the Inspiration rules for character defects encourage My Guy Syndrome?

Is it acceptable to use working hours to read general interest books?

Did war bonds have better investment alternatives during WWII?

Arriving in Atlanta after US Preclearance in Dublin. Will I go through TSA security in Atlanta to transfer to a connecting flight?

RIP Packet Format

How did Elite on the NES work?

Why doesn't the university give past final exams' answers?

How to keep bees out of canned beverages?

What do you call an IPA symbol that lacks a name (e.g. ɲ)?

Are `mathfont` and `mathspec` intended for same purpose?

Trigonometric and Exponential Integration

France's Public Holidays' Puzzle

Where to find documentation for `whois` command options?

My admission is revoked after accepting the admission offer

Delete Strings name, John, John Doe, Doe to name, John Doe

How do I deal with an erroneously large refund?

Array Dynamic resize in heap

What is the term for a person whose job is to place products on shelves in stores?



Wrong calculation of feature importance of decision tree in R



Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsHow to interpret a decision tree correctly?Should we convert independent continous variables (features) to categorical variable before using decision tree like classifier?Fit Decision Tree to Gradient Boosted Trees for InterpretabilityQ: xgboost regressor training on a large number of indicator variables results in same prediction for all rows in testMax depth for a decision tree in sklearnDecision tree orderingDecision Tree : PlayTennis Data SetHow are boosted decision stumps different from a decision tree?Isolation Forest Feature Importancehow does splitting occur at a node in a decision-tree with non-categorical data?










0












$begingroup$


I trained decision tree both in python and R, but I think the way feature importance is calculated in R may be wrong. Following is the sample code which you can use to reproduce the problem. Let's say I am predicting Income of population of 1000 based on gender and country.



x = data.frame(gender=sample(c("M","F"),n,T), country=sample(c("A","I"),n,T))
x$income = ifelse(x$gender=="M", rnorm(n, 100, 10), rnorm(n, 80, 10))
x$income = x$income + ifelse(x$country=="A", rnorm(n, 100, 10), rnorm(n, 80, 10))
write.csv(x, "data.csv")


Then lets fit a decision tree in R with max depth of 1.



fit = rpart(income~., data = x, control=rpart.control(maxdepth=1))
caret::varImp(fit)
fit


I get the following feature importance



country 0.2507630, and
gender 0.2424981


For the tree split only at country



1) root 1000 407373.4 180.5759 
2) country=I 481 147999.6 170.0772 *
3) country=A 519 157219.6 190.3060 *


When I try again with Max depth of 2, I get feature importance as



country 0.2507630, and
gender 0.8874599


For the tree split first at country as before and then at gender



1) root 1000 407373.40 180.5759 
2) country=I 481 147999.60 170.0772
4) gender=F 232 40082.49 159.2805 *
5) gender=M 249 55676.09 180.1367 *
3) country=A 519 157219.60 190.3060
6) gender=F 248 57546.77 180.4749 *
7) gender=M 271 53767.73 199.3028 *


However, if I run similar code in python



from io import StringIO
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree.export import export_graphviz
from IPython.display import Image
from sklearn import tree
import pydot
import pandas as pd

data = pd.read_csv("data.csv")

dtree=DecisionTreeRegressor(max_depth= 1)
X = data[["gender", "country" ]]
X["gender"] = X["gender"] == 'M'
X["country"] = X["country"] == 'A'
y = data[['income']]
dtree.fit(X,y)

# Export as dot file
export_graphviz(dtree, out_file='tree.dot',
feature_names = X.columns,filled=True, rounded=True,
special_characters=True)

(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')


for max depth of 1 I get feature importance as



gender 0, and 
country 1


and for max depth of two



gender 0.49, and 
country 0.51


Now I have following two questions

1) In R, when I select max depth = 1 and split happened only at country column, then why it still gives feature importance value for gender. Even though gender is not even part of the final model. For e.g. in python it gives gender as 0 var importance.

2) Secondly, why in R, the feature importance of gender column became greater than that of country column? As the country column was more important as the initial split happened at country and not at gender. Like the values we got for python.

One of my colleague pointed out that in R, the feature importance of each column is calculated at each split. For e.g. the some feature importance value of gender and country will be calculated at first split. Then again this happens at second split. But since we already had a split based on country there will not be any information gain based on country but would be there for gender. And in the end all these importance are summed. Hence we get more value for the ones that were used at a later stage for split.










share|improve this question









$endgroup$
















    0












    $begingroup$


    I trained decision tree both in python and R, but I think the way feature importance is calculated in R may be wrong. Following is the sample code which you can use to reproduce the problem. Let's say I am predicting Income of population of 1000 based on gender and country.



    x = data.frame(gender=sample(c("M","F"),n,T), country=sample(c("A","I"),n,T))
    x$income = ifelse(x$gender=="M", rnorm(n, 100, 10), rnorm(n, 80, 10))
    x$income = x$income + ifelse(x$country=="A", rnorm(n, 100, 10), rnorm(n, 80, 10))
    write.csv(x, "data.csv")


    Then lets fit a decision tree in R with max depth of 1.



    fit = rpart(income~., data = x, control=rpart.control(maxdepth=1))
    caret::varImp(fit)
    fit


    I get the following feature importance



    country 0.2507630, and
    gender 0.2424981


    For the tree split only at country



    1) root 1000 407373.4 180.5759 
    2) country=I 481 147999.6 170.0772 *
    3) country=A 519 157219.6 190.3060 *


    When I try again with Max depth of 2, I get feature importance as



    country 0.2507630, and
    gender 0.8874599


    For the tree split first at country as before and then at gender



    1) root 1000 407373.40 180.5759 
    2) country=I 481 147999.60 170.0772
    4) gender=F 232 40082.49 159.2805 *
    5) gender=M 249 55676.09 180.1367 *
    3) country=A 519 157219.60 190.3060
    6) gender=F 248 57546.77 180.4749 *
    7) gender=M 271 53767.73 199.3028 *


    However, if I run similar code in python



    from io import StringIO
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.tree.export import export_graphviz
    from IPython.display import Image
    from sklearn import tree
    import pydot
    import pandas as pd

    data = pd.read_csv("data.csv")

    dtree=DecisionTreeRegressor(max_depth= 1)
    X = data[["gender", "country" ]]
    X["gender"] = X["gender"] == 'M'
    X["country"] = X["country"] == 'A'
    y = data[['income']]
    dtree.fit(X,y)

    # Export as dot file
    export_graphviz(dtree, out_file='tree.dot',
    feature_names = X.columns,filled=True, rounded=True,
    special_characters=True)

    (graph,) = pydot.graph_from_dot_file('tree.dot')
    graph.write_png('tree.png')
    # Display in jupyter notebook
    from IPython.display import Image
    Image(filename = 'tree.png')


    for max depth of 1 I get feature importance as



    gender 0, and 
    country 1


    and for max depth of two



    gender 0.49, and 
    country 0.51


    Now I have following two questions

    1) In R, when I select max depth = 1 and split happened only at country column, then why it still gives feature importance value for gender. Even though gender is not even part of the final model. For e.g. in python it gives gender as 0 var importance.

    2) Secondly, why in R, the feature importance of gender column became greater than that of country column? As the country column was more important as the initial split happened at country and not at gender. Like the values we got for python.

    One of my colleague pointed out that in R, the feature importance of each column is calculated at each split. For e.g. the some feature importance value of gender and country will be calculated at first split. Then again this happens at second split. But since we already had a split based on country there will not be any information gain based on country but would be there for gender. And in the end all these importance are summed. Hence we get more value for the ones that were used at a later stage for split.










    share|improve this question









    $endgroup$














      0












      0








      0





      $begingroup$


      I trained decision tree both in python and R, but I think the way feature importance is calculated in R may be wrong. Following is the sample code which you can use to reproduce the problem. Let's say I am predicting Income of population of 1000 based on gender and country.



      x = data.frame(gender=sample(c("M","F"),n,T), country=sample(c("A","I"),n,T))
      x$income = ifelse(x$gender=="M", rnorm(n, 100, 10), rnorm(n, 80, 10))
      x$income = x$income + ifelse(x$country=="A", rnorm(n, 100, 10), rnorm(n, 80, 10))
      write.csv(x, "data.csv")


      Then lets fit a decision tree in R with max depth of 1.



      fit = rpart(income~., data = x, control=rpart.control(maxdepth=1))
      caret::varImp(fit)
      fit


      I get the following feature importance



      country 0.2507630, and
      gender 0.2424981


      For the tree split only at country



      1) root 1000 407373.4 180.5759 
      2) country=I 481 147999.6 170.0772 *
      3) country=A 519 157219.6 190.3060 *


      When I try again with Max depth of 2, I get feature importance as



      country 0.2507630, and
      gender 0.8874599


      For the tree split first at country as before and then at gender



      1) root 1000 407373.40 180.5759 
      2) country=I 481 147999.60 170.0772
      4) gender=F 232 40082.49 159.2805 *
      5) gender=M 249 55676.09 180.1367 *
      3) country=A 519 157219.60 190.3060
      6) gender=F 248 57546.77 180.4749 *
      7) gender=M 271 53767.73 199.3028 *


      However, if I run similar code in python



      from io import StringIO
      from sklearn.tree import DecisionTreeRegressor
      from sklearn.tree.export import export_graphviz
      from IPython.display import Image
      from sklearn import tree
      import pydot
      import pandas as pd

      data = pd.read_csv("data.csv")

      dtree=DecisionTreeRegressor(max_depth= 1)
      X = data[["gender", "country" ]]
      X["gender"] = X["gender"] == 'M'
      X["country"] = X["country"] == 'A'
      y = data[['income']]
      dtree.fit(X,y)

      # Export as dot file
      export_graphviz(dtree, out_file='tree.dot',
      feature_names = X.columns,filled=True, rounded=True,
      special_characters=True)

      (graph,) = pydot.graph_from_dot_file('tree.dot')
      graph.write_png('tree.png')
      # Display in jupyter notebook
      from IPython.display import Image
      Image(filename = 'tree.png')


      for max depth of 1 I get feature importance as



      gender 0, and 
      country 1


      and for max depth of two



      gender 0.49, and 
      country 0.51


      Now I have following two questions

      1) In R, when I select max depth = 1 and split happened only at country column, then why it still gives feature importance value for gender. Even though gender is not even part of the final model. For e.g. in python it gives gender as 0 var importance.

      2) Secondly, why in R, the feature importance of gender column became greater than that of country column? As the country column was more important as the initial split happened at country and not at gender. Like the values we got for python.

      One of my colleague pointed out that in R, the feature importance of each column is calculated at each split. For e.g. the some feature importance value of gender and country will be calculated at first split. Then again this happens at second split. But since we already had a split based on country there will not be any information gain based on country but would be there for gender. And in the end all these importance are summed. Hence we get more value for the ones that were used at a later stage for split.










      share|improve this question









      $endgroup$




      I trained decision tree both in python and R, but I think the way feature importance is calculated in R may be wrong. Following is the sample code which you can use to reproduce the problem. Let's say I am predicting Income of population of 1000 based on gender and country.



      x = data.frame(gender=sample(c("M","F"),n,T), country=sample(c("A","I"),n,T))
      x$income = ifelse(x$gender=="M", rnorm(n, 100, 10), rnorm(n, 80, 10))
      x$income = x$income + ifelse(x$country=="A", rnorm(n, 100, 10), rnorm(n, 80, 10))
      write.csv(x, "data.csv")


      Then lets fit a decision tree in R with max depth of 1.



      fit = rpart(income~., data = x, control=rpart.control(maxdepth=1))
      caret::varImp(fit)
      fit


      I get the following feature importance



      country 0.2507630, and
      gender 0.2424981


      For the tree split only at country



      1) root 1000 407373.4 180.5759 
      2) country=I 481 147999.6 170.0772 *
      3) country=A 519 157219.6 190.3060 *


      When I try again with Max depth of 2, I get feature importance as



      country 0.2507630, and
      gender 0.8874599


      For the tree split first at country as before and then at gender



      1) root 1000 407373.40 180.5759 
      2) country=I 481 147999.60 170.0772
      4) gender=F 232 40082.49 159.2805 *
      5) gender=M 249 55676.09 180.1367 *
      3) country=A 519 157219.60 190.3060
      6) gender=F 248 57546.77 180.4749 *
      7) gender=M 271 53767.73 199.3028 *


      However, if I run similar code in python



      from io import StringIO
      from sklearn.tree import DecisionTreeRegressor
      from sklearn.tree.export import export_graphviz
      from IPython.display import Image
      from sklearn import tree
      import pydot
      import pandas as pd

      data = pd.read_csv("data.csv")

      dtree=DecisionTreeRegressor(max_depth= 1)
      X = data[["gender", "country" ]]
      X["gender"] = X["gender"] == 'M'
      X["country"] = X["country"] == 'A'
      y = data[['income']]
      dtree.fit(X,y)

      # Export as dot file
      export_graphviz(dtree, out_file='tree.dot',
      feature_names = X.columns,filled=True, rounded=True,
      special_characters=True)

      (graph,) = pydot.graph_from_dot_file('tree.dot')
      graph.write_png('tree.png')
      # Display in jupyter notebook
      from IPython.display import Image
      Image(filename = 'tree.png')


      for max depth of 1 I get feature importance as



      gender 0, and 
      country 1


      and for max depth of two



      gender 0.49, and 
      country 0.51


      Now I have following two questions

      1) In R, when I select max depth = 1 and split happened only at country column, then why it still gives feature importance value for gender. Even though gender is not even part of the final model. For e.g. in python it gives gender as 0 var importance.

      2) Secondly, why in R, the feature importance of gender column became greater than that of country column? As the country column was more important as the initial split happened at country and not at gender. Like the values we got for python.

      One of my colleague pointed out that in R, the feature importance of each column is calculated at each split. For e.g. the some feature importance value of gender and country will be calculated at first split. Then again this happens at second split. But since we already had a split based on country there will not be any information gain based on country but would be there for gender. And in the end all these importance are summed. Hence we get more value for the ones that were used at a later stage for split.







      python r decision-trees






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Apr 5 at 17:32









      MNAMNA

      12




      12




















          0






          active

          oldest

          votes












          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48705%2fwrong-calculation-of-feature-importance-of-decision-tree-in-r%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48705%2fwrong-calculation-of-feature-importance-of-decision-tree-in-r%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

          Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High