Grouping probabilities of full model vs fitting model on subset of variables2019 Community Moderator ElectionAdjusting Probabilities When Using Alternative Cutoffs For ClassificationIllustrating the dimensionality reduction done by a classification or regression modelForecasting an individual based on a representative groupMulticlass classification with large number of classes but for each user the set of target classes is knownAdvance Methods of Understanding Significance of Customer BehaviorsSelecting the right algorithm for match probability predictionFind optimal P(X|Y) given I have a model that has good performance when trained on P(Y|X)Baking perfect cookies: methods or techniques for determining optimal ranges of x variables in classification modelsIs removing poorly predicted data points a valid approach?How to make machine learning model that reports ambiguity of the input?

Describing a person. What needs to be mentioned?

Purchasing a ticket for someone else in another country?

Large drywall patch supports

Where does the Z80 processor start executing from?

Anatomically Correct Strange Women In Ponds Distributing Swords

What grammatical function is や performing here?

Why doesn't table tennis float on the surface? How do we calculate buoyancy here?

How can I quit an app using Terminal?

Was Spock the First Vulcan in Starfleet?

When is out-of-tune okay?

Short story about space worker geeks who zone out by 'listening' to radiation from stars

How to Reset Passwords on Multiple Websites Easily?

Flow chart document symbol

How does it work when somebody invests in my business?

How can a function with a hole (removable discontinuity) equal a function with no hole?

Rotate a column

Escape a mm/dd/YY backup date in a file name

Can the discrete variable be a negative number?

Is HostGator storing my password in plaintext?

What can we do to stop prior company from asking us questions?

How do scammers retract money, while you can’t?

Efficient way to transport a Stargate

Can "Reverse Gravity" affect spells?

Avoiding estate tax by giving multiple gifts



Grouping probabilities of full model vs fitting model on subset of variables



2019 Community Moderator ElectionAdjusting Probabilities When Using Alternative Cutoffs For ClassificationIllustrating the dimensionality reduction done by a classification or regression modelForecasting an individual based on a representative groupMulticlass classification with large number of classes but for each user the set of target classes is knownAdvance Methods of Understanding Significance of Customer BehaviorsSelecting the right algorithm for match probability predictionFind optimal P(X|Y) given I have a model that has good performance when trained on P(Y|X)Baking perfect cookies: methods or techniques for determining optimal ranges of x variables in classification modelsIs removing poorly predicted data points a valid approach?How to make machine learning model that reports ambiguity of the input?










1












$begingroup$


Suppose I want to predict some event probability with a set of features. Some of the features may be gender or some other categorical variable. Assume the probabilities are well calibrated.



Now, say, I need to report the average predicted probability for males and females (and maybe some other subset of features).



Is there a difference between



  1. averaging the probabilities of the full model or

  2. simply fitting a model on only the subset of variables?

And if so why? The full model would outperform the simpler model by a large margin. An example would also be appreciated.










share|improve this question









$endgroup$











  • $begingroup$
    Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
    $endgroup$
    – Esmailian
    Mar 22 at 18:04










  • $begingroup$
    sorry, not sure what you are looking for?!
    $endgroup$
    – oW_
    Mar 22 at 18:11










  • $begingroup$
    For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
    $endgroup$
    – Esmailian
    Mar 22 at 18:25











  • $begingroup$
    both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
    $endgroup$
    – oW_
    Mar 22 at 18:27
















1












$begingroup$


Suppose I want to predict some event probability with a set of features. Some of the features may be gender or some other categorical variable. Assume the probabilities are well calibrated.



Now, say, I need to report the average predicted probability for males and females (and maybe some other subset of features).



Is there a difference between



  1. averaging the probabilities of the full model or

  2. simply fitting a model on only the subset of variables?

And if so why? The full model would outperform the simpler model by a large margin. An example would also be appreciated.










share|improve this question









$endgroup$











  • $begingroup$
    Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
    $endgroup$
    – Esmailian
    Mar 22 at 18:04










  • $begingroup$
    sorry, not sure what you are looking for?!
    $endgroup$
    – oW_
    Mar 22 at 18:11










  • $begingroup$
    For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
    $endgroup$
    – Esmailian
    Mar 22 at 18:25











  • $begingroup$
    both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
    $endgroup$
    – oW_
    Mar 22 at 18:27














1












1








1


1



$begingroup$


Suppose I want to predict some event probability with a set of features. Some of the features may be gender or some other categorical variable. Assume the probabilities are well calibrated.



Now, say, I need to report the average predicted probability for males and females (and maybe some other subset of features).



Is there a difference between



  1. averaging the probabilities of the full model or

  2. simply fitting a model on only the subset of variables?

And if so why? The full model would outperform the simpler model by a large margin. An example would also be appreciated.










share|improve this question









$endgroup$




Suppose I want to predict some event probability with a set of features. Some of the features may be gender or some other categorical variable. Assume the probabilities are well calibrated.



Now, say, I need to report the average predicted probability for males and females (and maybe some other subset of features).



Is there a difference between



  1. averaging the probabilities of the full model or

  2. simply fitting a model on only the subset of variables?

And if so why? The full model would outperform the simpler model by a large margin. An example would also be appreciated.







machine-learning predictive-modeling statistics






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 22 at 17:16









oW_oW_

3,261732




3,261732











  • $begingroup$
    Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
    $endgroup$
    – Esmailian
    Mar 22 at 18:04










  • $begingroup$
    sorry, not sure what you are looking for?!
    $endgroup$
    – oW_
    Mar 22 at 18:11










  • $begingroup$
    For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
    $endgroup$
    – Esmailian
    Mar 22 at 18:25











  • $begingroup$
    both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
    $endgroup$
    – oW_
    Mar 22 at 18:27

















  • $begingroup$
    Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
    $endgroup$
    – Esmailian
    Mar 22 at 18:04










  • $begingroup$
    sorry, not sure what you are looking for?!
    $endgroup$
    – oW_
    Mar 22 at 18:11










  • $begingroup$
    For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
    $endgroup$
    – Esmailian
    Mar 22 at 18:25











  • $begingroup$
    both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
    $endgroup$
    – oW_
    Mar 22 at 18:27
















$begingroup$
Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
$endgroup$
– Esmailian
Mar 22 at 18:04




$begingroup$
Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
$endgroup$
– Esmailian
Mar 22 at 18:04












$begingroup$
sorry, not sure what you are looking for?!
$endgroup$
– oW_
Mar 22 at 18:11




$begingroup$
sorry, not sure what you are looking for?!
$endgroup$
– oW_
Mar 22 at 18:11












$begingroup$
For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
$endgroup$
– Esmailian
Mar 22 at 18:25





$begingroup$
For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
$endgroup$
– Esmailian
Mar 22 at 18:25













$begingroup$
both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
$endgroup$
– oW_
Mar 22 at 18:27





$begingroup$
both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
$endgroup$
– oW_
Mar 22 at 18:27











1 Answer
1






active

oldest

votes


















1












$begingroup$

A very thought provoking question.



Surprisingly, second approach (subset) is better in theory. The first one (full) is an unbiased estimator of second one (subset). That is, expectation of average of probabilities in full model is equal to the probability in subset model.



I said "in theory", because we assume that full and subset models estimate the true P(event | full) and P(event | subset) exactly respectively (zero error, perfect generalization, etc.). In practice, choice of learning algorithm and training set affects how accurate they can fulfill this assumption. A model could be Naive Bayes, Logistic Regression, Neural Network with a SoftMax layer, etc.



Proof



Let




  1. $F_f$ denote the full features,


  2. $F_s$ denote the subset features (that we group on),


  3. $F_d=F_f-F_s$ denote the features that we average out,


  4. $boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and


  5. $A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.

Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$



Expectation of above average is:
$$beginalign*
E[P_f(e|F_s=a)]&=frac1N_Asum_xin AE[P(e|F_d(x),a)]\
&=frac1N_A N_A E[P(e|F_f(X))|X in boldsymbolA]\
&=sum_bP(e|F_d=b,F_s=a)P(F_d=b|F_s=a)\
&=sum_bP(e,F_d=b|F_s=a)\
&=P(e|F_s=a)
endalign*$$



where $P(e|F_s=a)$ is the output of second approach.



Example



To verify, here is an example:



$e=texthas_job$



$F_f=textsex, texthas_degree$



$F_s=textsex$



$F_d=texthas_degree$



Full model sees:



x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0

4 m 0 0
5 m 0 0
6 m 1 1


And builds



sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0


And answers $P(texthas_job=1|textsex=f)$ with



$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$



and answers $P(texthas_job=1|textsex=m)$ with



$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$



On the other hand, subset model sees:



x sex has_job
1 f 0
2 f 1
3 f 0

4 m 0
5 m 0
6 m 1


And builds



sex P(has_job=1|sex)
f 0.33
m 0.33


And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.






share|improve this answer











$endgroup$












  • $begingroup$
    Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
    $endgroup$
    – oW_
    Mar 22 at 23:11










  • $begingroup$
    @oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
    $endgroup$
    – Esmailian
    Mar 22 at 23:19










  • $begingroup$
    You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
    $endgroup$
    – oW_
    Mar 22 at 23:35










  • $begingroup$
    @oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
    $endgroup$
    – Esmailian
    Mar 22 at 23:46










Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47806%2fgrouping-probabilities-of-full-model-vs-fitting-model-on-subset-of-variables%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1












$begingroup$

A very thought provoking question.



Surprisingly, second approach (subset) is better in theory. The first one (full) is an unbiased estimator of second one (subset). That is, expectation of average of probabilities in full model is equal to the probability in subset model.



I said "in theory", because we assume that full and subset models estimate the true P(event | full) and P(event | subset) exactly respectively (zero error, perfect generalization, etc.). In practice, choice of learning algorithm and training set affects how accurate they can fulfill this assumption. A model could be Naive Bayes, Logistic Regression, Neural Network with a SoftMax layer, etc.



Proof



Let




  1. $F_f$ denote the full features,


  2. $F_s$ denote the subset features (that we group on),


  3. $F_d=F_f-F_s$ denote the features that we average out,


  4. $boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and


  5. $A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.

Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$



Expectation of above average is:
$$beginalign*
E[P_f(e|F_s=a)]&=frac1N_Asum_xin AE[P(e|F_d(x),a)]\
&=frac1N_A N_A E[P(e|F_f(X))|X in boldsymbolA]\
&=sum_bP(e|F_d=b,F_s=a)P(F_d=b|F_s=a)\
&=sum_bP(e,F_d=b|F_s=a)\
&=P(e|F_s=a)
endalign*$$



where $P(e|F_s=a)$ is the output of second approach.



Example



To verify, here is an example:



$e=texthas_job$



$F_f=textsex, texthas_degree$



$F_s=textsex$



$F_d=texthas_degree$



Full model sees:



x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0

4 m 0 0
5 m 0 0
6 m 1 1


And builds



sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0


And answers $P(texthas_job=1|textsex=f)$ with



$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$



and answers $P(texthas_job=1|textsex=m)$ with



$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$



On the other hand, subset model sees:



x sex has_job
1 f 0
2 f 1
3 f 0

4 m 0
5 m 0
6 m 1


And builds



sex P(has_job=1|sex)
f 0.33
m 0.33


And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.






share|improve this answer











$endgroup$












  • $begingroup$
    Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
    $endgroup$
    – oW_
    Mar 22 at 23:11










  • $begingroup$
    @oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
    $endgroup$
    – Esmailian
    Mar 22 at 23:19










  • $begingroup$
    You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
    $endgroup$
    – oW_
    Mar 22 at 23:35










  • $begingroup$
    @oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
    $endgroup$
    – Esmailian
    Mar 22 at 23:46















1












$begingroup$

A very thought provoking question.



Surprisingly, second approach (subset) is better in theory. The first one (full) is an unbiased estimator of second one (subset). That is, expectation of average of probabilities in full model is equal to the probability in subset model.



I said "in theory", because we assume that full and subset models estimate the true P(event | full) and P(event | subset) exactly respectively (zero error, perfect generalization, etc.). In practice, choice of learning algorithm and training set affects how accurate they can fulfill this assumption. A model could be Naive Bayes, Logistic Regression, Neural Network with a SoftMax layer, etc.



Proof



Let




  1. $F_f$ denote the full features,


  2. $F_s$ denote the subset features (that we group on),


  3. $F_d=F_f-F_s$ denote the features that we average out,


  4. $boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and


  5. $A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.

Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$



Expectation of above average is:
$$beginalign*
E[P_f(e|F_s=a)]&=frac1N_Asum_xin AE[P(e|F_d(x),a)]\
&=frac1N_A N_A E[P(e|F_f(X))|X in boldsymbolA]\
&=sum_bP(e|F_d=b,F_s=a)P(F_d=b|F_s=a)\
&=sum_bP(e,F_d=b|F_s=a)\
&=P(e|F_s=a)
endalign*$$



where $P(e|F_s=a)$ is the output of second approach.



Example



To verify, here is an example:



$e=texthas_job$



$F_f=textsex, texthas_degree$



$F_s=textsex$



$F_d=texthas_degree$



Full model sees:



x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0

4 m 0 0
5 m 0 0
6 m 1 1


And builds



sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0


And answers $P(texthas_job=1|textsex=f)$ with



$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$



and answers $P(texthas_job=1|textsex=m)$ with



$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$



On the other hand, subset model sees:



x sex has_job
1 f 0
2 f 1
3 f 0

4 m 0
5 m 0
6 m 1


And builds



sex P(has_job=1|sex)
f 0.33
m 0.33


And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.






share|improve this answer











$endgroup$












  • $begingroup$
    Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
    $endgroup$
    – oW_
    Mar 22 at 23:11










  • $begingroup$
    @oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
    $endgroup$
    – Esmailian
    Mar 22 at 23:19










  • $begingroup$
    You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
    $endgroup$
    – oW_
    Mar 22 at 23:35










  • $begingroup$
    @oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
    $endgroup$
    – Esmailian
    Mar 22 at 23:46













1












1








1





$begingroup$

A very thought provoking question.



Surprisingly, second approach (subset) is better in theory. The first one (full) is an unbiased estimator of second one (subset). That is, expectation of average of probabilities in full model is equal to the probability in subset model.



I said "in theory", because we assume that full and subset models estimate the true P(event | full) and P(event | subset) exactly respectively (zero error, perfect generalization, etc.). In practice, choice of learning algorithm and training set affects how accurate they can fulfill this assumption. A model could be Naive Bayes, Logistic Regression, Neural Network with a SoftMax layer, etc.



Proof



Let




  1. $F_f$ denote the full features,


  2. $F_s$ denote the subset features (that we group on),


  3. $F_d=F_f-F_s$ denote the features that we average out,


  4. $boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and


  5. $A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.

Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$



Expectation of above average is:
$$beginalign*
E[P_f(e|F_s=a)]&=frac1N_Asum_xin AE[P(e|F_d(x),a)]\
&=frac1N_A N_A E[P(e|F_f(X))|X in boldsymbolA]\
&=sum_bP(e|F_d=b,F_s=a)P(F_d=b|F_s=a)\
&=sum_bP(e,F_d=b|F_s=a)\
&=P(e|F_s=a)
endalign*$$



where $P(e|F_s=a)$ is the output of second approach.



Example



To verify, here is an example:



$e=texthas_job$



$F_f=textsex, texthas_degree$



$F_s=textsex$



$F_d=texthas_degree$



Full model sees:



x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0

4 m 0 0
5 m 0 0
6 m 1 1


And builds



sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0


And answers $P(texthas_job=1|textsex=f)$ with



$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$



and answers $P(texthas_job=1|textsex=m)$ with



$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$



On the other hand, subset model sees:



x sex has_job
1 f 0
2 f 1
3 f 0

4 m 0
5 m 0
6 m 1


And builds



sex P(has_job=1|sex)
f 0.33
m 0.33


And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.






share|improve this answer











$endgroup$



A very thought provoking question.



Surprisingly, second approach (subset) is better in theory. The first one (full) is an unbiased estimator of second one (subset). That is, expectation of average of probabilities in full model is equal to the probability in subset model.



I said "in theory", because we assume that full and subset models estimate the true P(event | full) and P(event | subset) exactly respectively (zero error, perfect generalization, etc.). In practice, choice of learning algorithm and training set affects how accurate they can fulfill this assumption. A model could be Naive Bayes, Logistic Regression, Neural Network with a SoftMax layer, etc.



Proof



Let




  1. $F_f$ denote the full features,


  2. $F_s$ denote the subset features (that we group on),


  3. $F_d=F_f-F_s$ denote the features that we average out,


  4. $boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and


  5. $A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.

Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$



Expectation of above average is:
$$beginalign*
E[P_f(e|F_s=a)]&=frac1N_Asum_xin AE[P(e|F_d(x),a)]\
&=frac1N_A N_A E[P(e|F_f(X))|X in boldsymbolA]\
&=sum_bP(e|F_d=b,F_s=a)P(F_d=b|F_s=a)\
&=sum_bP(e,F_d=b|F_s=a)\
&=P(e|F_s=a)
endalign*$$



where $P(e|F_s=a)$ is the output of second approach.



Example



To verify, here is an example:



$e=texthas_job$



$F_f=textsex, texthas_degree$



$F_s=textsex$



$F_d=texthas_degree$



Full model sees:



x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0

4 m 0 0
5 m 0 0
6 m 1 1


And builds



sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0


And answers $P(texthas_job=1|textsex=f)$ with



$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$



and answers $P(texthas_job=1|textsex=m)$ with



$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$



On the other hand, subset model sees:



x sex has_job
1 f 0
2 f 1
3 f 0

4 m 0
5 m 0
6 m 1


And builds



sex P(has_job=1|sex)
f 0.33
m 0.33


And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 23 at 0:08

























answered Mar 22 at 17:42









EsmailianEsmailian

1,976216




1,976216











  • $begingroup$
    Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
    $endgroup$
    – oW_
    Mar 22 at 23:11










  • $begingroup$
    @oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
    $endgroup$
    – Esmailian
    Mar 22 at 23:19










  • $begingroup$
    You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
    $endgroup$
    – oW_
    Mar 22 at 23:35










  • $begingroup$
    @oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
    $endgroup$
    – Esmailian
    Mar 22 at 23:46
















  • $begingroup$
    Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
    $endgroup$
    – oW_
    Mar 22 at 23:11










  • $begingroup$
    @oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
    $endgroup$
    – Esmailian
    Mar 22 at 23:19










  • $begingroup$
    You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
    $endgroup$
    – oW_
    Mar 22 at 23:35










  • $begingroup$
    @oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
    $endgroup$
    – Esmailian
    Mar 22 at 23:46















$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11




$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11












$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19




$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19












$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35




$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35












$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46




$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47806%2fgrouping-probabilities-of-full-model-vs-fitting-model-on-subset-of-variables%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High