How to extract the sample split (values) of decision tree leaves ( terminal nodes) applying h2o libraryHow to interpret a decision tree correctly?fix first two levels of decision tree?Contrasting logistic regression vs decision tree performance in specific exampleHow does XGBoost compute the probabilities in predict_proba()?Max depth for a decision tree in sklearnImbalanced classification data with a top decile conversion metricCustomized function for Agglomerative ClusteringDecision trees: leaf-wise (best-first) and level-wise tree traversehow does XGBoost's exact greedy split finding algorithm determine candidate split values for different feature types?Forcing a multi-label multi-class tree-based classifier to make more label predictions per document

Biological Blimps: Propulsion

Fear of getting stuck on one programming language / technology that is not used in my country

Approximating irrational number to rational number

When a Cleric spontaneously casts a Cure Light Wounds spell, will a Pearl of Power recover the original spell or Cure Light Wounds?

Finding NDSolve method details

How can Trident be so inexpensive? Will it orbit Triton or just do a (slow) flyby?

Character escape sequences for ">"

By means of an example, show that P(A) + P(B) = 1 does not mean that B is the complement of A.

How much character growth crosses the line into breaking the character

Is a model fitted to data or is data fitted to a model?

Can I use Seifert-van Kampen theorem infinite times

Removing files under particular conditions (number of files, file age)

How should I respond when I lied about my education and the company finds out through background check?

Where does the bonus feat in the cleric starting package come from?

GraphicsGrid with a Label for each Column and Row

A question about fixed points and non-expansive map

Does the expansion of the universe explain why the universe doesn't collapse?

Why is so much work done on numerical verification of the Riemann Hypothesis?

Reply 'no position' while the job posting is still there

What should you do when eye contact makes your subordinate uncomfortable?

How could a planet have erratic days?

Where did Heinlein say "Once you get to Earth orbit, you're halfway to anywhere in the Solar System"?

Does a 'pending' US visa application constitute a denial?

Melting point of aspirin, contradicting sources



How to extract the sample split (values) of decision tree leaves ( terminal nodes) applying h2o library


How to interpret a decision tree correctly?fix first two levels of decision tree?Contrasting logistic regression vs decision tree performance in specific exampleHow does XGBoost compute the probabilities in predict_proba()?Max depth for a decision tree in sklearnImbalanced classification data with a top decile conversion metricCustomized function for Agglomerative ClusteringDecision trees: leaf-wise (best-first) and level-wise tree traversehow does XGBoost's exact greedy split finding algorithm determine candidate split values for different feature types?Forcing a multi-label multi-class tree-based classifier to make more label predictions per document













2












$begingroup$


Sorry for a long story, but it is a long story. :)
I am using h2o library for python to build a decision tree and to extract a decision rules out of it.
I use some data for training where labels get TRUE and FALSE values.
My final goal is to extract the significant path (leaf) of the tree where the number of TRUE cases significantly exceeds that of FALSE ones.



treemodel=H2OGradientBoostingEstimator(ntrees = 3, max_depth = maxDepth, distribution="bernoulli")
treemodel.train(x=somedata.names[1:],y=somelabel.names[0], training_frame=somedata)
dtree = H2OTree(model = treemodel, tree_number = 0, tree_class = False)

def predict_leaf_node_assignment(self, test_data, type="Path"):
if not isinstance(test_data, h2o.H2OFrame): raise ValueError("test_data
must be an instance of H2OFrame")
assert_is_type(type, None, Enum("Path", "Node_ID"))
j = h2o.api("POST /3/Predictions/models/%s/frames/%s" % (self.model_id,
test_data.frame_id),
data="leaf_node_assignment": True, "leaf_node_assignment_type":
type)
return h2o.get_frame(j["predictions_frame"]["name"])
dfResLabH2O.leafs = predict_leaf_node_assignment( dtree,test_data=dfResLabH2O , type="Path")


In sklearn there is an option to explore the leaves by using tree.values
There is not such an option for h2o as I understand.
Instead of that there is an option in h2o to return predictions on leaves.



When I run dtree.predictions
I am getting pretty weird results



dtree.predictions
Out[32]: [0.0, -0.020934915, 0.0832189, -0.0151052615, -0.13453846, -0.0039859135, 0.2931017, 0.0836743, -0.008562919, -0.12405087, -0.02181114, 0.06444048, -0.01736593, 0.13912177, 0.10727943]***


My questions (and somebody has already asked it, but no clear answer was provided so far)



  1. What's the meaning of negative predictions? I expect to get a proportions p of TRUE to ALL or FALSE to ALL, where 0<=p<=1. Anything wrong with my model.
    I ran it in skitlearn and I can point out the certain significant paths and extract rules.


  2. For positive values : Is it TRUE to ALL or False to ALL proportion? I am guessing it so FALSE as I mentioned Class=False, but I am not sure.


  3. Is there any method or solution for h20 trees to reveal the sample size of the certain leaf and the [n1,n2] for TRUE and FALSE cases respectively in a similar way that sklearn provides?


  4. I found in some forums a function def predict_leaf_node_assignment that aims to predict on a dataset and to return the leaf node assignment (only for tree-based models), but it returns no output and I cannot find any example how to implement it.


  5. The bottom line : I'd like to be able to extract the sample size values of the leaf and to extract the specific path to it, implementing [n1,n2] or valid proportions.


I'll appreciate any kind of help and suggestions.
Thank you.










share|improve this question











$endgroup$




bumped to the homepage by Community 2 days ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.














  • $begingroup$
    Can you use code formatting to make it easier to read the code portions? Thanks!
    $endgroup$
    – Wes
    Feb 12 at 19:09










  • $begingroup$
    Some more of your code might be helpful. In particular, does H2O know that your target is categorical?...It might be trying to do regression instead, leading to those negative predictions? (Edit: or, it might be reporting the log-odds rather than the probability.)
    $endgroup$
    – Ben Reiniger
    Feb 13 at 2:45










  • $begingroup$
    Please see the first row, I applied GBE model for binomial family of targets. As to log odds, I calculated the proportions given that they are log odds, and I got them all between 0.46 and 0.53 - i.e all 15 leaves of the decision tree have almost equal number of TRUE and FALSE cases? Does not make sense and contradicts the sklearn findings for the same data. And generally why would non parametric Decision Tree model return log odd ratios, it's not a logistic regression... The main and the most important attribute of the tree is the proportion of the divided subsamples in the leaves
    $endgroup$
    – Sapiens
    Feb 13 at 17:43
















2












$begingroup$


Sorry for a long story, but it is a long story. :)
I am using h2o library for python to build a decision tree and to extract a decision rules out of it.
I use some data for training where labels get TRUE and FALSE values.
My final goal is to extract the significant path (leaf) of the tree where the number of TRUE cases significantly exceeds that of FALSE ones.



treemodel=H2OGradientBoostingEstimator(ntrees = 3, max_depth = maxDepth, distribution="bernoulli")
treemodel.train(x=somedata.names[1:],y=somelabel.names[0], training_frame=somedata)
dtree = H2OTree(model = treemodel, tree_number = 0, tree_class = False)

def predict_leaf_node_assignment(self, test_data, type="Path"):
if not isinstance(test_data, h2o.H2OFrame): raise ValueError("test_data
must be an instance of H2OFrame")
assert_is_type(type, None, Enum("Path", "Node_ID"))
j = h2o.api("POST /3/Predictions/models/%s/frames/%s" % (self.model_id,
test_data.frame_id),
data="leaf_node_assignment": True, "leaf_node_assignment_type":
type)
return h2o.get_frame(j["predictions_frame"]["name"])
dfResLabH2O.leafs = predict_leaf_node_assignment( dtree,test_data=dfResLabH2O , type="Path")


In sklearn there is an option to explore the leaves by using tree.values
There is not such an option for h2o as I understand.
Instead of that there is an option in h2o to return predictions on leaves.



When I run dtree.predictions
I am getting pretty weird results



dtree.predictions
Out[32]: [0.0, -0.020934915, 0.0832189, -0.0151052615, -0.13453846, -0.0039859135, 0.2931017, 0.0836743, -0.008562919, -0.12405087, -0.02181114, 0.06444048, -0.01736593, 0.13912177, 0.10727943]***


My questions (and somebody has already asked it, but no clear answer was provided so far)



  1. What's the meaning of negative predictions? I expect to get a proportions p of TRUE to ALL or FALSE to ALL, where 0<=p<=1. Anything wrong with my model.
    I ran it in skitlearn and I can point out the certain significant paths and extract rules.


  2. For positive values : Is it TRUE to ALL or False to ALL proportion? I am guessing it so FALSE as I mentioned Class=False, but I am not sure.


  3. Is there any method or solution for h20 trees to reveal the sample size of the certain leaf and the [n1,n2] for TRUE and FALSE cases respectively in a similar way that sklearn provides?


  4. I found in some forums a function def predict_leaf_node_assignment that aims to predict on a dataset and to return the leaf node assignment (only for tree-based models), but it returns no output and I cannot find any example how to implement it.


  5. The bottom line : I'd like to be able to extract the sample size values of the leaf and to extract the specific path to it, implementing [n1,n2] or valid proportions.


I'll appreciate any kind of help and suggestions.
Thank you.










share|improve this question











$endgroup$




bumped to the homepage by Community 2 days ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.














  • $begingroup$
    Can you use code formatting to make it easier to read the code portions? Thanks!
    $endgroup$
    – Wes
    Feb 12 at 19:09










  • $begingroup$
    Some more of your code might be helpful. In particular, does H2O know that your target is categorical?...It might be trying to do regression instead, leading to those negative predictions? (Edit: or, it might be reporting the log-odds rather than the probability.)
    $endgroup$
    – Ben Reiniger
    Feb 13 at 2:45










  • $begingroup$
    Please see the first row, I applied GBE model for binomial family of targets. As to log odds, I calculated the proportions given that they are log odds, and I got them all between 0.46 and 0.53 - i.e all 15 leaves of the decision tree have almost equal number of TRUE and FALSE cases? Does not make sense and contradicts the sklearn findings for the same data. And generally why would non parametric Decision Tree model return log odd ratios, it's not a logistic regression... The main and the most important attribute of the tree is the proportion of the divided subsamples in the leaves
    $endgroup$
    – Sapiens
    Feb 13 at 17:43














2












2








2





$begingroup$


Sorry for a long story, but it is a long story. :)
I am using h2o library for python to build a decision tree and to extract a decision rules out of it.
I use some data for training where labels get TRUE and FALSE values.
My final goal is to extract the significant path (leaf) of the tree where the number of TRUE cases significantly exceeds that of FALSE ones.



treemodel=H2OGradientBoostingEstimator(ntrees = 3, max_depth = maxDepth, distribution="bernoulli")
treemodel.train(x=somedata.names[1:],y=somelabel.names[0], training_frame=somedata)
dtree = H2OTree(model = treemodel, tree_number = 0, tree_class = False)

def predict_leaf_node_assignment(self, test_data, type="Path"):
if not isinstance(test_data, h2o.H2OFrame): raise ValueError("test_data
must be an instance of H2OFrame")
assert_is_type(type, None, Enum("Path", "Node_ID"))
j = h2o.api("POST /3/Predictions/models/%s/frames/%s" % (self.model_id,
test_data.frame_id),
data="leaf_node_assignment": True, "leaf_node_assignment_type":
type)
return h2o.get_frame(j["predictions_frame"]["name"])
dfResLabH2O.leafs = predict_leaf_node_assignment( dtree,test_data=dfResLabH2O , type="Path")


In sklearn there is an option to explore the leaves by using tree.values
There is not such an option for h2o as I understand.
Instead of that there is an option in h2o to return predictions on leaves.



When I run dtree.predictions
I am getting pretty weird results



dtree.predictions
Out[32]: [0.0, -0.020934915, 0.0832189, -0.0151052615, -0.13453846, -0.0039859135, 0.2931017, 0.0836743, -0.008562919, -0.12405087, -0.02181114, 0.06444048, -0.01736593, 0.13912177, 0.10727943]***


My questions (and somebody has already asked it, but no clear answer was provided so far)



  1. What's the meaning of negative predictions? I expect to get a proportions p of TRUE to ALL or FALSE to ALL, where 0<=p<=1. Anything wrong with my model.
    I ran it in skitlearn and I can point out the certain significant paths and extract rules.


  2. For positive values : Is it TRUE to ALL or False to ALL proportion? I am guessing it so FALSE as I mentioned Class=False, but I am not sure.


  3. Is there any method or solution for h20 trees to reveal the sample size of the certain leaf and the [n1,n2] for TRUE and FALSE cases respectively in a similar way that sklearn provides?


  4. I found in some forums a function def predict_leaf_node_assignment that aims to predict on a dataset and to return the leaf node assignment (only for tree-based models), but it returns no output and I cannot find any example how to implement it.


  5. The bottom line : I'd like to be able to extract the sample size values of the leaf and to extract the specific path to it, implementing [n1,n2] or valid proportions.


I'll appreciate any kind of help and suggestions.
Thank you.










share|improve this question











$endgroup$




Sorry for a long story, but it is a long story. :)
I am using h2o library for python to build a decision tree and to extract a decision rules out of it.
I use some data for training where labels get TRUE and FALSE values.
My final goal is to extract the significant path (leaf) of the tree where the number of TRUE cases significantly exceeds that of FALSE ones.



treemodel=H2OGradientBoostingEstimator(ntrees = 3, max_depth = maxDepth, distribution="bernoulli")
treemodel.train(x=somedata.names[1:],y=somelabel.names[0], training_frame=somedata)
dtree = H2OTree(model = treemodel, tree_number = 0, tree_class = False)

def predict_leaf_node_assignment(self, test_data, type="Path"):
if not isinstance(test_data, h2o.H2OFrame): raise ValueError("test_data
must be an instance of H2OFrame")
assert_is_type(type, None, Enum("Path", "Node_ID"))
j = h2o.api("POST /3/Predictions/models/%s/frames/%s" % (self.model_id,
test_data.frame_id),
data="leaf_node_assignment": True, "leaf_node_assignment_type":
type)
return h2o.get_frame(j["predictions_frame"]["name"])
dfResLabH2O.leafs = predict_leaf_node_assignment( dtree,test_data=dfResLabH2O , type="Path")


In sklearn there is an option to explore the leaves by using tree.values
There is not such an option for h2o as I understand.
Instead of that there is an option in h2o to return predictions on leaves.



When I run dtree.predictions
I am getting pretty weird results



dtree.predictions
Out[32]: [0.0, -0.020934915, 0.0832189, -0.0151052615, -0.13453846, -0.0039859135, 0.2931017, 0.0836743, -0.008562919, -0.12405087, -0.02181114, 0.06444048, -0.01736593, 0.13912177, 0.10727943]***


My questions (and somebody has already asked it, but no clear answer was provided so far)



  1. What's the meaning of negative predictions? I expect to get a proportions p of TRUE to ALL or FALSE to ALL, where 0<=p<=1. Anything wrong with my model.
    I ran it in skitlearn and I can point out the certain significant paths and extract rules.


  2. For positive values : Is it TRUE to ALL or False to ALL proportion? I am guessing it so FALSE as I mentioned Class=False, but I am not sure.


  3. Is there any method or solution for h20 trees to reveal the sample size of the certain leaf and the [n1,n2] for TRUE and FALSE cases respectively in a similar way that sklearn provides?


  4. I found in some forums a function def predict_leaf_node_assignment that aims to predict on a dataset and to return the leaf node assignment (only for tree-based models), but it returns no output and I cannot find any example how to implement it.


  5. The bottom line : I'd like to be able to extract the sample size values of the leaf and to extract the specific path to it, implementing [n1,n2] or valid proportions.


I'll appreciate any kind of help and suggestions.
Thank you.







python data decision-trees prediction h2o






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 19 at 19:05







Sapiens

















asked Feb 12 at 19:08









SapiensSapiens

112




112





bumped to the homepage by Community 2 days ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 2 days ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.













  • $begingroup$
    Can you use code formatting to make it easier to read the code portions? Thanks!
    $endgroup$
    – Wes
    Feb 12 at 19:09










  • $begingroup$
    Some more of your code might be helpful. In particular, does H2O know that your target is categorical?...It might be trying to do regression instead, leading to those negative predictions? (Edit: or, it might be reporting the log-odds rather than the probability.)
    $endgroup$
    – Ben Reiniger
    Feb 13 at 2:45










  • $begingroup$
    Please see the first row, I applied GBE model for binomial family of targets. As to log odds, I calculated the proportions given that they are log odds, and I got them all between 0.46 and 0.53 - i.e all 15 leaves of the decision tree have almost equal number of TRUE and FALSE cases? Does not make sense and contradicts the sklearn findings for the same data. And generally why would non parametric Decision Tree model return log odd ratios, it's not a logistic regression... The main and the most important attribute of the tree is the proportion of the divided subsamples in the leaves
    $endgroup$
    – Sapiens
    Feb 13 at 17:43

















  • $begingroup$
    Can you use code formatting to make it easier to read the code portions? Thanks!
    $endgroup$
    – Wes
    Feb 12 at 19:09










  • $begingroup$
    Some more of your code might be helpful. In particular, does H2O know that your target is categorical?...It might be trying to do regression instead, leading to those negative predictions? (Edit: or, it might be reporting the log-odds rather than the probability.)
    $endgroup$
    – Ben Reiniger
    Feb 13 at 2:45










  • $begingroup$
    Please see the first row, I applied GBE model for binomial family of targets. As to log odds, I calculated the proportions given that they are log odds, and I got them all between 0.46 and 0.53 - i.e all 15 leaves of the decision tree have almost equal number of TRUE and FALSE cases? Does not make sense and contradicts the sklearn findings for the same data. And generally why would non parametric Decision Tree model return log odd ratios, it's not a logistic regression... The main and the most important attribute of the tree is the proportion of the divided subsamples in the leaves
    $endgroup$
    – Sapiens
    Feb 13 at 17:43
















$begingroup$
Can you use code formatting to make it easier to read the code portions? Thanks!
$endgroup$
– Wes
Feb 12 at 19:09




$begingroup$
Can you use code formatting to make it easier to read the code portions? Thanks!
$endgroup$
– Wes
Feb 12 at 19:09












$begingroup$
Some more of your code might be helpful. In particular, does H2O know that your target is categorical?...It might be trying to do regression instead, leading to those negative predictions? (Edit: or, it might be reporting the log-odds rather than the probability.)
$endgroup$
– Ben Reiniger
Feb 13 at 2:45




$begingroup$
Some more of your code might be helpful. In particular, does H2O know that your target is categorical?...It might be trying to do regression instead, leading to those negative predictions? (Edit: or, it might be reporting the log-odds rather than the probability.)
$endgroup$
– Ben Reiniger
Feb 13 at 2:45












$begingroup$
Please see the first row, I applied GBE model for binomial family of targets. As to log odds, I calculated the proportions given that they are log odds, and I got them all between 0.46 and 0.53 - i.e all 15 leaves of the decision tree have almost equal number of TRUE and FALSE cases? Does not make sense and contradicts the sklearn findings for the same data. And generally why would non parametric Decision Tree model return log odd ratios, it's not a logistic regression... The main and the most important attribute of the tree is the proportion of the divided subsamples in the leaves
$endgroup$
– Sapiens
Feb 13 at 17:43





$begingroup$
Please see the first row, I applied GBE model for binomial family of targets. As to log odds, I calculated the proportions given that they are log odds, and I got them all between 0.46 and 0.53 - i.e all 15 leaves of the decision tree have almost equal number of TRUE and FALSE cases? Does not make sense and contradicts the sklearn findings for the same data. And generally why would non parametric Decision Tree model return log odd ratios, it's not a logistic regression... The main and the most important attribute of the tree is the proportion of the divided subsamples in the leaves
$endgroup$
– Sapiens
Feb 13 at 17:43











1 Answer
1






active

oldest

votes


















0












$begingroup$

So far I'm not seeing a way to extract training information from the model. The H2OTree.predictions can/should give you proportion information, but won't give you leaf sample sizes. For that, you should be able to use predict_leaf_node_assignment, passing your training set in (to wastefully get passed through the model, *shrug*).



predict_leaf_node_assignment should return a dataframe with the leaf assignment for each of your training points. (The R version appears to support returning either the path or the node id, but the python one doesn't seem to have it.) You could take this, join to the original frame, and use group and aggregation functions to produce the desired [n1,n2].*



Regarding the output of predictions, see https://stackoverflow.com/questions/44735518/how-to-reproduce-the-h2o-gbm-class-probability-calculation . In particular, the default learning rate in H2O's GBM is 0.1, which helps explain your muted results.



Finally, for a little more fun with the the model's tree objects, see https://www.pavel.cool/machine%20learning/h2o-3/h2o-3-tree-api/
and
https://novyden.blogspot.com/2018/12/finally-you-can-plot-h2o-decision-trees.html



*EDIT: For doing the grouping and aggregation:

(I'm more used to pandas than H2O frames, so I'll convert first. And given that H2O thinks your FALSE class is the main class, maybe those are strings not boolean?)



predicted_leaves_frame = treemodel.predict_leaf_node_assignment(data).as_data_frame()
df = data.as_data_frame()
df['binary_dep_var'] = df['dep_var'].apply(lambda x: 1 if x=='TRUE' else 0)
df['T1'] = predicted_leaves_frame['T1.C1']
print(df.groupby('T1')['binary_dep_var'].agg(['sum','count','mean'])


This should give for each leaf the number of TRUE samples and the total number of samples and the ratio. If you really need the number of FALSE samples, you could define your own aggregation function or just post-process this new dataframe.






share|improve this answer











$endgroup$












  • $begingroup$
    Try it as treemodel.predict_leaf_node_assignment(data).
    $endgroup$
    – Ben Reiniger
    Feb 19 at 19:26










  • $begingroup$
    OK, thanks. What I got now is a column of leaf_assignment RRLL RRRL RRRL RRRL RRRL RRRL RLRR What do I do in Python to get at least proportions if not n1 n2, for the terrminal leaves, the code I saw in your link relates to R. Thanks
    $endgroup$
    – Sapiens
    Feb 19 at 20:05











  • $begingroup$
    See the new edit; I've added essentially what I had in mind with "join to the original frame, and use group and aggregation functions". (It could probably be done natively with the H2O frames.)
    $endgroup$
    – Ben Reiniger
    Feb 19 at 22:07










Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45461%2fhow-to-extract-the-sample-split-values-of-decision-tree-leaves-terminal-node%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0












$begingroup$

So far I'm not seeing a way to extract training information from the model. The H2OTree.predictions can/should give you proportion information, but won't give you leaf sample sizes. For that, you should be able to use predict_leaf_node_assignment, passing your training set in (to wastefully get passed through the model, *shrug*).



predict_leaf_node_assignment should return a dataframe with the leaf assignment for each of your training points. (The R version appears to support returning either the path or the node id, but the python one doesn't seem to have it.) You could take this, join to the original frame, and use group and aggregation functions to produce the desired [n1,n2].*



Regarding the output of predictions, see https://stackoverflow.com/questions/44735518/how-to-reproduce-the-h2o-gbm-class-probability-calculation . In particular, the default learning rate in H2O's GBM is 0.1, which helps explain your muted results.



Finally, for a little more fun with the the model's tree objects, see https://www.pavel.cool/machine%20learning/h2o-3/h2o-3-tree-api/
and
https://novyden.blogspot.com/2018/12/finally-you-can-plot-h2o-decision-trees.html



*EDIT: For doing the grouping and aggregation:

(I'm more used to pandas than H2O frames, so I'll convert first. And given that H2O thinks your FALSE class is the main class, maybe those are strings not boolean?)



predicted_leaves_frame = treemodel.predict_leaf_node_assignment(data).as_data_frame()
df = data.as_data_frame()
df['binary_dep_var'] = df['dep_var'].apply(lambda x: 1 if x=='TRUE' else 0)
df['T1'] = predicted_leaves_frame['T1.C1']
print(df.groupby('T1')['binary_dep_var'].agg(['sum','count','mean'])


This should give for each leaf the number of TRUE samples and the total number of samples and the ratio. If you really need the number of FALSE samples, you could define your own aggregation function or just post-process this new dataframe.






share|improve this answer











$endgroup$












  • $begingroup$
    Try it as treemodel.predict_leaf_node_assignment(data).
    $endgroup$
    – Ben Reiniger
    Feb 19 at 19:26










  • $begingroup$
    OK, thanks. What I got now is a column of leaf_assignment RRLL RRRL RRRL RRRL RRRL RRRL RLRR What do I do in Python to get at least proportions if not n1 n2, for the terrminal leaves, the code I saw in your link relates to R. Thanks
    $endgroup$
    – Sapiens
    Feb 19 at 20:05











  • $begingroup$
    See the new edit; I've added essentially what I had in mind with "join to the original frame, and use group and aggregation functions". (It could probably be done natively with the H2O frames.)
    $endgroup$
    – Ben Reiniger
    Feb 19 at 22:07















0












$begingroup$

So far I'm not seeing a way to extract training information from the model. The H2OTree.predictions can/should give you proportion information, but won't give you leaf sample sizes. For that, you should be able to use predict_leaf_node_assignment, passing your training set in (to wastefully get passed through the model, *shrug*).



predict_leaf_node_assignment should return a dataframe with the leaf assignment for each of your training points. (The R version appears to support returning either the path or the node id, but the python one doesn't seem to have it.) You could take this, join to the original frame, and use group and aggregation functions to produce the desired [n1,n2].*



Regarding the output of predictions, see https://stackoverflow.com/questions/44735518/how-to-reproduce-the-h2o-gbm-class-probability-calculation . In particular, the default learning rate in H2O's GBM is 0.1, which helps explain your muted results.



Finally, for a little more fun with the the model's tree objects, see https://www.pavel.cool/machine%20learning/h2o-3/h2o-3-tree-api/
and
https://novyden.blogspot.com/2018/12/finally-you-can-plot-h2o-decision-trees.html



*EDIT: For doing the grouping and aggregation:

(I'm more used to pandas than H2O frames, so I'll convert first. And given that H2O thinks your FALSE class is the main class, maybe those are strings not boolean?)



predicted_leaves_frame = treemodel.predict_leaf_node_assignment(data).as_data_frame()
df = data.as_data_frame()
df['binary_dep_var'] = df['dep_var'].apply(lambda x: 1 if x=='TRUE' else 0)
df['T1'] = predicted_leaves_frame['T1.C1']
print(df.groupby('T1')['binary_dep_var'].agg(['sum','count','mean'])


This should give for each leaf the number of TRUE samples and the total number of samples and the ratio. If you really need the number of FALSE samples, you could define your own aggregation function or just post-process this new dataframe.






share|improve this answer











$endgroup$












  • $begingroup$
    Try it as treemodel.predict_leaf_node_assignment(data).
    $endgroup$
    – Ben Reiniger
    Feb 19 at 19:26










  • $begingroup$
    OK, thanks. What I got now is a column of leaf_assignment RRLL RRRL RRRL RRRL RRRL RRRL RLRR What do I do in Python to get at least proportions if not n1 n2, for the terrminal leaves, the code I saw in your link relates to R. Thanks
    $endgroup$
    – Sapiens
    Feb 19 at 20:05











  • $begingroup$
    See the new edit; I've added essentially what I had in mind with "join to the original frame, and use group and aggregation functions". (It could probably be done natively with the H2O frames.)
    $endgroup$
    – Ben Reiniger
    Feb 19 at 22:07













0












0








0





$begingroup$

So far I'm not seeing a way to extract training information from the model. The H2OTree.predictions can/should give you proportion information, but won't give you leaf sample sizes. For that, you should be able to use predict_leaf_node_assignment, passing your training set in (to wastefully get passed through the model, *shrug*).



predict_leaf_node_assignment should return a dataframe with the leaf assignment for each of your training points. (The R version appears to support returning either the path or the node id, but the python one doesn't seem to have it.) You could take this, join to the original frame, and use group and aggregation functions to produce the desired [n1,n2].*



Regarding the output of predictions, see https://stackoverflow.com/questions/44735518/how-to-reproduce-the-h2o-gbm-class-probability-calculation . In particular, the default learning rate in H2O's GBM is 0.1, which helps explain your muted results.



Finally, for a little more fun with the the model's tree objects, see https://www.pavel.cool/machine%20learning/h2o-3/h2o-3-tree-api/
and
https://novyden.blogspot.com/2018/12/finally-you-can-plot-h2o-decision-trees.html



*EDIT: For doing the grouping and aggregation:

(I'm more used to pandas than H2O frames, so I'll convert first. And given that H2O thinks your FALSE class is the main class, maybe those are strings not boolean?)



predicted_leaves_frame = treemodel.predict_leaf_node_assignment(data).as_data_frame()
df = data.as_data_frame()
df['binary_dep_var'] = df['dep_var'].apply(lambda x: 1 if x=='TRUE' else 0)
df['T1'] = predicted_leaves_frame['T1.C1']
print(df.groupby('T1')['binary_dep_var'].agg(['sum','count','mean'])


This should give for each leaf the number of TRUE samples and the total number of samples and the ratio. If you really need the number of FALSE samples, you could define your own aggregation function or just post-process this new dataframe.






share|improve this answer











$endgroup$



So far I'm not seeing a way to extract training information from the model. The H2OTree.predictions can/should give you proportion information, but won't give you leaf sample sizes. For that, you should be able to use predict_leaf_node_assignment, passing your training set in (to wastefully get passed through the model, *shrug*).



predict_leaf_node_assignment should return a dataframe with the leaf assignment for each of your training points. (The R version appears to support returning either the path or the node id, but the python one doesn't seem to have it.) You could take this, join to the original frame, and use group and aggregation functions to produce the desired [n1,n2].*



Regarding the output of predictions, see https://stackoverflow.com/questions/44735518/how-to-reproduce-the-h2o-gbm-class-probability-calculation . In particular, the default learning rate in H2O's GBM is 0.1, which helps explain your muted results.



Finally, for a little more fun with the the model's tree objects, see https://www.pavel.cool/machine%20learning/h2o-3/h2o-3-tree-api/
and
https://novyden.blogspot.com/2018/12/finally-you-can-plot-h2o-decision-trees.html



*EDIT: For doing the grouping and aggregation:

(I'm more used to pandas than H2O frames, so I'll convert first. And given that H2O thinks your FALSE class is the main class, maybe those are strings not boolean?)



predicted_leaves_frame = treemodel.predict_leaf_node_assignment(data).as_data_frame()
df = data.as_data_frame()
df['binary_dep_var'] = df['dep_var'].apply(lambda x: 1 if x=='TRUE' else 0)
df['T1'] = predicted_leaves_frame['T1.C1']
print(df.groupby('T1')['binary_dep_var'].agg(['sum','count','mean'])


This should give for each leaf the number of TRUE samples and the total number of samples and the ratio. If you really need the number of FALSE samples, you could define your own aggregation function or just post-process this new dataframe.







share|improve this answer














share|improve this answer



share|improve this answer








edited Feb 19 at 21:58

























answered Feb 14 at 16:45









Ben ReinigerBen Reiniger

31819




31819











  • $begingroup$
    Try it as treemodel.predict_leaf_node_assignment(data).
    $endgroup$
    – Ben Reiniger
    Feb 19 at 19:26










  • $begingroup$
    OK, thanks. What I got now is a column of leaf_assignment RRLL RRRL RRRL RRRL RRRL RRRL RLRR What do I do in Python to get at least proportions if not n1 n2, for the terrminal leaves, the code I saw in your link relates to R. Thanks
    $endgroup$
    – Sapiens
    Feb 19 at 20:05











  • $begingroup$
    See the new edit; I've added essentially what I had in mind with "join to the original frame, and use group and aggregation functions". (It could probably be done natively with the H2O frames.)
    $endgroup$
    – Ben Reiniger
    Feb 19 at 22:07
















  • $begingroup$
    Try it as treemodel.predict_leaf_node_assignment(data).
    $endgroup$
    – Ben Reiniger
    Feb 19 at 19:26










  • $begingroup$
    OK, thanks. What I got now is a column of leaf_assignment RRLL RRRL RRRL RRRL RRRL RRRL RLRR What do I do in Python to get at least proportions if not n1 n2, for the terrminal leaves, the code I saw in your link relates to R. Thanks
    $endgroup$
    – Sapiens
    Feb 19 at 20:05











  • $begingroup$
    See the new edit; I've added essentially what I had in mind with "join to the original frame, and use group and aggregation functions". (It could probably be done natively with the H2O frames.)
    $endgroup$
    – Ben Reiniger
    Feb 19 at 22:07















$begingroup$
Try it as treemodel.predict_leaf_node_assignment(data).
$endgroup$
– Ben Reiniger
Feb 19 at 19:26




$begingroup$
Try it as treemodel.predict_leaf_node_assignment(data).
$endgroup$
– Ben Reiniger
Feb 19 at 19:26












$begingroup$
OK, thanks. What I got now is a column of leaf_assignment RRLL RRRL RRRL RRRL RRRL RRRL RLRR What do I do in Python to get at least proportions if not n1 n2, for the terrminal leaves, the code I saw in your link relates to R. Thanks
$endgroup$
– Sapiens
Feb 19 at 20:05





$begingroup$
OK, thanks. What I got now is a column of leaf_assignment RRLL RRRL RRRL RRRL RRRL RRRL RLRR What do I do in Python to get at least proportions if not n1 n2, for the terrminal leaves, the code I saw in your link relates to R. Thanks
$endgroup$
– Sapiens
Feb 19 at 20:05













$begingroup$
See the new edit; I've added essentially what I had in mind with "join to the original frame, and use group and aggregation functions". (It could probably be done natively with the H2O frames.)
$endgroup$
– Ben Reiniger
Feb 19 at 22:07




$begingroup$
See the new edit; I've added essentially what I had in mind with "join to the original frame, and use group and aggregation functions". (It could probably be done natively with the H2O frames.)
$endgroup$
– Ben Reiniger
Feb 19 at 22:07

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45461%2fhow-to-extract-the-sample-split-values-of-decision-tree-leaves-terminal-node%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High