Grouping probabilities of full model vs fitting model on subset of variables2019 Community Moderator ElectionAdjusting Probabilities When Using Alternative Cutoffs For ClassificationIllustrating the dimensionality reduction done by a classification or regression modelForecasting an individual based on a representative groupMulticlass classification with large number of classes but for each user the set of target classes is knownAdvance Methods of Understanding Significance of Customer BehaviorsSelecting the right algorithm for match probability predictionFind optimal P(X|Y) given I have a model that has good performance when trained on P(Y|X)Baking perfect cookies: methods or techniques for determining optimal ranges of x variables in classification modelsIs removing poorly predicted data points a valid approach?How to make machine learning model that reports ambiguity of the input?
Describing a person. What needs to be mentioned?
Purchasing a ticket for someone else in another country?
Large drywall patch supports
Where does the Z80 processor start executing from?
Anatomically Correct Strange Women In Ponds Distributing Swords
What grammatical function is や performing here?
Why doesn't table tennis float on the surface? How do we calculate buoyancy here?
How can I quit an app using Terminal?
Was Spock the First Vulcan in Starfleet?
When is out-of-tune okay?
Short story about space worker geeks who zone out by 'listening' to radiation from stars
How to Reset Passwords on Multiple Websites Easily?
Flow chart document symbol
How does it work when somebody invests in my business?
How can a function with a hole (removable discontinuity) equal a function with no hole?
Rotate a column
Escape a mm/dd/YY backup date in a file name
Can the discrete variable be a negative number?
Is HostGator storing my password in plaintext?
What can we do to stop prior company from asking us questions?
How do scammers retract money, while you can’t?
Efficient way to transport a Stargate
Can "Reverse Gravity" affect spells?
Avoiding estate tax by giving multiple gifts
Grouping probabilities of full model vs fitting model on subset of variables
2019 Community Moderator ElectionAdjusting Probabilities When Using Alternative Cutoffs For ClassificationIllustrating the dimensionality reduction done by a classification or regression modelForecasting an individual based on a representative groupMulticlass classification with large number of classes but for each user the set of target classes is knownAdvance Methods of Understanding Significance of Customer BehaviorsSelecting the right algorithm for match probability predictionFind optimal P(X|Y) given I have a model that has good performance when trained on P(Y|X)Baking perfect cookies: methods or techniques for determining optimal ranges of x variables in classification modelsIs removing poorly predicted data points a valid approach?How to make machine learning model that reports ambiguity of the input?
$begingroup$
Suppose I want to predict some event probability with a set of features. Some of the features may be gender or some other categorical variable. Assume the probabilities are well calibrated.
Now, say, I need to report the average predicted probability for males and females (and maybe some other subset of features).
Is there a difference between
- averaging the probabilities of the full model or
- simply fitting a model on only the subset of variables?
And if so why? The full model would outperform the simpler model by a large margin. An example would also be appreciated.
machine-learning predictive-modeling statistics
$endgroup$
add a comment |
$begingroup$
Suppose I want to predict some event probability with a set of features. Some of the features may be gender or some other categorical variable. Assume the probabilities are well calibrated.
Now, say, I need to report the average predicted probability for males and females (and maybe some other subset of features).
Is there a difference between
- averaging the probabilities of the full model or
- simply fitting a model on only the subset of variables?
And if so why? The full model would outperform the simpler model by a large margin. An example would also be appreciated.
machine-learning predictive-modeling statistics
$endgroup$
$begingroup$
Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
$endgroup$
– Esmailian
Mar 22 at 18:04
$begingroup$
sorry, not sure what you are looking for?!
$endgroup$
– oW_
Mar 22 at 18:11
$begingroup$
For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
$endgroup$
– Esmailian
Mar 22 at 18:25
$begingroup$
both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
$endgroup$
– oW_
Mar 22 at 18:27
add a comment |
$begingroup$
Suppose I want to predict some event probability with a set of features. Some of the features may be gender or some other categorical variable. Assume the probabilities are well calibrated.
Now, say, I need to report the average predicted probability for males and females (and maybe some other subset of features).
Is there a difference between
- averaging the probabilities of the full model or
- simply fitting a model on only the subset of variables?
And if so why? The full model would outperform the simpler model by a large margin. An example would also be appreciated.
machine-learning predictive-modeling statistics
$endgroup$
Suppose I want to predict some event probability with a set of features. Some of the features may be gender or some other categorical variable. Assume the probabilities are well calibrated.
Now, say, I need to report the average predicted probability for males and females (and maybe some other subset of features).
Is there a difference between
- averaging the probabilities of the full model or
- simply fitting a model on only the subset of variables?
And if so why? The full model would outperform the simpler model by a large margin. An example would also be appreciated.
machine-learning predictive-modeling statistics
machine-learning predictive-modeling statistics
asked Mar 22 at 17:16
oW_oW_
3,261732
3,261732
$begingroup$
Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
$endgroup$
– Esmailian
Mar 22 at 18:04
$begingroup$
sorry, not sure what you are looking for?!
$endgroup$
– oW_
Mar 22 at 18:11
$begingroup$
For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
$endgroup$
– Esmailian
Mar 22 at 18:25
$begingroup$
both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
$endgroup$
– oW_
Mar 22 at 18:27
add a comment |
$begingroup$
Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
$endgroup$
– Esmailian
Mar 22 at 18:04
$begingroup$
sorry, not sure what you are looking for?!
$endgroup$
– oW_
Mar 22 at 18:11
$begingroup$
For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
$endgroup$
– Esmailian
Mar 22 at 18:25
$begingroup$
both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
$endgroup$
– oW_
Mar 22 at 18:27
$begingroup$
Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
$endgroup$
– Esmailian
Mar 22 at 18:04
$begingroup$
Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
$endgroup$
– Esmailian
Mar 22 at 18:04
$begingroup$
sorry, not sure what you are looking for?!
$endgroup$
– oW_
Mar 22 at 18:11
$begingroup$
sorry, not sure what you are looking for?!
$endgroup$
– oW_
Mar 22 at 18:11
$begingroup$
For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
$endgroup$
– Esmailian
Mar 22 at 18:25
$begingroup$
For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
$endgroup$
– Esmailian
Mar 22 at 18:25
$begingroup$
both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
$endgroup$
– oW_
Mar 22 at 18:27
$begingroup$
both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
$endgroup$
– oW_
Mar 22 at 18:27
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
A very thought provoking question.
Surprisingly, second approach (subset) is better in theory. The first one (full) is an unbiased estimator of second one (subset). That is, expectation of average of probabilities in full model is equal to the probability in subset model.
I said "in theory", because we assume that full and subset models estimate the true P(event | full) and P(event | subset) exactly respectively (zero error, perfect generalization, etc.). In practice, choice of learning algorithm and training set affects how accurate they can fulfill this assumption. A model could be Naive Bayes, Logistic Regression, Neural Network with a SoftMax layer, etc.
Proof
Let
$F_f$ denote the full features,
$F_s$ denote the subset features (that we group on),
$F_d=F_f-F_s$ denote the features that we average out,
$boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and
$A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.
Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$
Expectation of above average is:
$$beginalign*
E[P_f(e|F_s=a)]&=frac1N_Asum_xin AE[P(e|F_d(x),a)]\
&=frac1N_A N_A E[P(e|F_f(X))|X in boldsymbolA]\
&=sum_bP(e|F_d=b,F_s=a)P(F_d=b|F_s=a)\
&=sum_bP(e,F_d=b|F_s=a)\
&=P(e|F_s=a)
endalign*$$
where $P(e|F_s=a)$ is the output of second approach.
Example
To verify, here is an example:
$e=texthas_job$
$F_f=textsex, texthas_degree$
$F_s=textsex$
$F_d=texthas_degree$
Full model sees:
x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0
4 m 0 0
5 m 0 0
6 m 1 1
And builds
sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0
And answers $P(texthas_job=1|textsex=f)$ with
$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$
and answers $P(texthas_job=1|textsex=m)$ with
$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$
On the other hand, subset model sees:
x sex has_job
1 f 0
2 f 1
3 f 0
4 m 0
5 m 0
6 m 1
And builds
sex P(has_job=1|sex)
f 0.33
m 0.33
And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.
$endgroup$
$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11
$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19
$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35
$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47806%2fgrouping-probabilities-of-full-model-vs-fitting-model-on-subset-of-variables%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
A very thought provoking question.
Surprisingly, second approach (subset) is better in theory. The first one (full) is an unbiased estimator of second one (subset). That is, expectation of average of probabilities in full model is equal to the probability in subset model.
I said "in theory", because we assume that full and subset models estimate the true P(event | full) and P(event | subset) exactly respectively (zero error, perfect generalization, etc.). In practice, choice of learning algorithm and training set affects how accurate they can fulfill this assumption. A model could be Naive Bayes, Logistic Regression, Neural Network with a SoftMax layer, etc.
Proof
Let
$F_f$ denote the full features,
$F_s$ denote the subset features (that we group on),
$F_d=F_f-F_s$ denote the features that we average out,
$boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and
$A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.
Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$
Expectation of above average is:
$$beginalign*
E[P_f(e|F_s=a)]&=frac1N_Asum_xin AE[P(e|F_d(x),a)]\
&=frac1N_A N_A E[P(e|F_f(X))|X in boldsymbolA]\
&=sum_bP(e|F_d=b,F_s=a)P(F_d=b|F_s=a)\
&=sum_bP(e,F_d=b|F_s=a)\
&=P(e|F_s=a)
endalign*$$
where $P(e|F_s=a)$ is the output of second approach.
Example
To verify, here is an example:
$e=texthas_job$
$F_f=textsex, texthas_degree$
$F_s=textsex$
$F_d=texthas_degree$
Full model sees:
x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0
4 m 0 0
5 m 0 0
6 m 1 1
And builds
sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0
And answers $P(texthas_job=1|textsex=f)$ with
$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$
and answers $P(texthas_job=1|textsex=m)$ with
$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$
On the other hand, subset model sees:
x sex has_job
1 f 0
2 f 1
3 f 0
4 m 0
5 m 0
6 m 1
And builds
sex P(has_job=1|sex)
f 0.33
m 0.33
And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.
$endgroup$
$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11
$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19
$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35
$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46
add a comment |
$begingroup$
A very thought provoking question.
Surprisingly, second approach (subset) is better in theory. The first one (full) is an unbiased estimator of second one (subset). That is, expectation of average of probabilities in full model is equal to the probability in subset model.
I said "in theory", because we assume that full and subset models estimate the true P(event | full) and P(event | subset) exactly respectively (zero error, perfect generalization, etc.). In practice, choice of learning algorithm and training set affects how accurate they can fulfill this assumption. A model could be Naive Bayes, Logistic Regression, Neural Network with a SoftMax layer, etc.
Proof
Let
$F_f$ denote the full features,
$F_s$ denote the subset features (that we group on),
$F_d=F_f-F_s$ denote the features that we average out,
$boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and
$A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.
Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$
Expectation of above average is:
$$beginalign*
E[P_f(e|F_s=a)]&=frac1N_Asum_xin AE[P(e|F_d(x),a)]\
&=frac1N_A N_A E[P(e|F_f(X))|X in boldsymbolA]\
&=sum_bP(e|F_d=b,F_s=a)P(F_d=b|F_s=a)\
&=sum_bP(e,F_d=b|F_s=a)\
&=P(e|F_s=a)
endalign*$$
where $P(e|F_s=a)$ is the output of second approach.
Example
To verify, here is an example:
$e=texthas_job$
$F_f=textsex, texthas_degree$
$F_s=textsex$
$F_d=texthas_degree$
Full model sees:
x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0
4 m 0 0
5 m 0 0
6 m 1 1
And builds
sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0
And answers $P(texthas_job=1|textsex=f)$ with
$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$
and answers $P(texthas_job=1|textsex=m)$ with
$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$
On the other hand, subset model sees:
x sex has_job
1 f 0
2 f 1
3 f 0
4 m 0
5 m 0
6 m 1
And builds
sex P(has_job=1|sex)
f 0.33
m 0.33
And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.
$endgroup$
$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11
$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19
$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35
$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46
add a comment |
$begingroup$
A very thought provoking question.
Surprisingly, second approach (subset) is better in theory. The first one (full) is an unbiased estimator of second one (subset). That is, expectation of average of probabilities in full model is equal to the probability in subset model.
I said "in theory", because we assume that full and subset models estimate the true P(event | full) and P(event | subset) exactly respectively (zero error, perfect generalization, etc.). In practice, choice of learning algorithm and training set affects how accurate they can fulfill this assumption. A model could be Naive Bayes, Logistic Regression, Neural Network with a SoftMax layer, etc.
Proof
Let
$F_f$ denote the full features,
$F_s$ denote the subset features (that we group on),
$F_d=F_f-F_s$ denote the features that we average out,
$boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and
$A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.
Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$
Expectation of above average is:
$$beginalign*
E[P_f(e|F_s=a)]&=frac1N_Asum_xin AE[P(e|F_d(x),a)]\
&=frac1N_A N_A E[P(e|F_f(X))|X in boldsymbolA]\
&=sum_bP(e|F_d=b,F_s=a)P(F_d=b|F_s=a)\
&=sum_bP(e,F_d=b|F_s=a)\
&=P(e|F_s=a)
endalign*$$
where $P(e|F_s=a)$ is the output of second approach.
Example
To verify, here is an example:
$e=texthas_job$
$F_f=textsex, texthas_degree$
$F_s=textsex$
$F_d=texthas_degree$
Full model sees:
x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0
4 m 0 0
5 m 0 0
6 m 1 1
And builds
sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0
And answers $P(texthas_job=1|textsex=f)$ with
$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$
and answers $P(texthas_job=1|textsex=m)$ with
$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$
On the other hand, subset model sees:
x sex has_job
1 f 0
2 f 1
3 f 0
4 m 0
5 m 0
6 m 1
And builds
sex P(has_job=1|sex)
f 0.33
m 0.33
And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.
$endgroup$
A very thought provoking question.
Surprisingly, second approach (subset) is better in theory. The first one (full) is an unbiased estimator of second one (subset). That is, expectation of average of probabilities in full model is equal to the probability in subset model.
I said "in theory", because we assume that full and subset models estimate the true P(event | full) and P(event | subset) exactly respectively (zero error, perfect generalization, etc.). In practice, choice of learning algorithm and training set affects how accurate they can fulfill this assumption. A model could be Naive Bayes, Logistic Regression, Neural Network with a SoftMax layer, etc.
Proof
Let
$F_f$ denote the full features,
$F_s$ denote the subset features (that we group on),
$F_d=F_f-F_s$ denote the features that we average out,
$boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and
$A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.
Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$
Expectation of above average is:
$$beginalign*
E[P_f(e|F_s=a)]&=frac1N_Asum_xin AE[P(e|F_d(x),a)]\
&=frac1N_A N_A E[P(e|F_f(X))|X in boldsymbolA]\
&=sum_bP(e|F_d=b,F_s=a)P(F_d=b|F_s=a)\
&=sum_bP(e,F_d=b|F_s=a)\
&=P(e|F_s=a)
endalign*$$
where $P(e|F_s=a)$ is the output of second approach.
Example
To verify, here is an example:
$e=texthas_job$
$F_f=textsex, texthas_degree$
$F_s=textsex$
$F_d=texthas_degree$
Full model sees:
x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0
4 m 0 0
5 m 0 0
6 m 1 1
And builds
sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0
And answers $P(texthas_job=1|textsex=f)$ with
$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$
and answers $P(texthas_job=1|textsex=m)$ with
$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$
On the other hand, subset model sees:
x sex has_job
1 f 0
2 f 1
3 f 0
4 m 0
5 m 0
6 m 1
And builds
sex P(has_job=1|sex)
f 0.33
m 0.33
And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.
edited Mar 23 at 0:08
answered Mar 22 at 17:42
EsmailianEsmailian
1,976216
1,976216
$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11
$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19
$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35
$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46
add a comment |
$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11
$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19
$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35
$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46
$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11
$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11
$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19
$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19
$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35
$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35
$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46
$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47806%2fgrouping-probabilities-of-full-model-vs-fitting-model-on-subset-of-variables%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
$endgroup$
– Esmailian
Mar 22 at 18:04
$begingroup$
sorry, not sure what you are looking for?!
$endgroup$
– oW_
Mar 22 at 18:11
$begingroup$
For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
$endgroup$
– Esmailian
Mar 22 at 18:25
$begingroup$
both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
$endgroup$
– oW_
Mar 22 at 18:27