Grouping probabilities of full model vs fitting model on subset of variables2019 Community Moderator ElectionAdjusting Probabilities When Using Alternative Cutoffs For ClassificationIllustrating the dimensionality reduction done by a classification or regression modelForecasting an individual based on a representative groupMulticlass classification with large number of classes but for each user the set of target classes is knownAdvance Methods of Understanding Significance of Customer BehaviorsSelecting the right algorithm for match probability predictionFind optimal P(X|Y) given I have a model that has good performance when trained on P(Y|X)Baking perfect cookies: methods or techniques for determining optimal ranges of x variables in classification modelsIs removing poorly predicted data points a valid approach?How to make machine learning model that reports ambiguity of the input?

Describing a person. What needs to be mentioned?

Purchasing a ticket for someone else in another country?

Large drywall patch supports

Where does the Z80 processor start executing from?

Anatomically Correct Strange Women In Ponds Distributing Swords

What grammatical function is や performing here?

Why doesn't table tennis float on the surface? How do we calculate buoyancy here?

How can I quit an app using Terminal?

Was Spock the First Vulcan in Starfleet?

When is out-of-tune okay?

Short story about space worker geeks who zone out by 'listening' to radiation from stars

How to Reset Passwords on Multiple Websites Easily?

Flow chart document symbol

How does it work when somebody invests in my business?

How can a function with a hole (removable discontinuity) equal a function with no hole?

Rotate a column

Escape a mm/dd/YY backup date in a file name

Can the discrete variable be a negative number?

Is HostGator storing my password in plaintext?

What can we do to stop prior company from asking us questions?

How do scammers retract money, while you can’t?

Efficient way to transport a Stargate

Can "Reverse Gravity" affect spells?

Avoiding estate tax by giving multiple gifts

Grouping probabilities of full model vs fitting model on subset of variables

2019 Community Moderator ElectionAdjusting Probabilities When Using Alternative Cutoffs For ClassificationIllustrating the dimensionality reduction done by a classification or regression modelForecasting an individual based on a representative groupMulticlass classification with large number of classes but for each user the set of target classes is knownAdvance Methods of Understanding Significance of Customer BehaviorsSelecting the right algorithm for match probability predictionFind optimal P(X|Y) given I have a model that has good performance when trained on P(Y|X)Baking perfect cookies: methods or techniques for determining optimal ranges of x variables in classification modelsIs removing poorly predicted data points a valid approach?How to make machine learning model that reports ambiguity of the input?

Suppose I want to predict some event probability with a set of features. Some of the features may be gender or some other categorical variable. Assume the probabilities are well calibrated.

Now, say, I need to report the average predicted probability for males and females (and maybe some other subset of features).

Is there a difference between

averaging the probabilities of the full model or

simply fitting a model on only the subset of variables?

And if so why? The full model would outperform the simpler model by a large margin. An example would also be appreciated.

asked Mar 22 at 17:16

oW_

3,261732

$begingroup$
Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
$endgroup$
– Esmailian
Mar 22 at 18:04

$begingroup$
sorry, not sure what you are looking for?!
$endgroup$
– oW_
Mar 22 at 18:11

$begingroup$
For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
$endgroup$
– Esmailian
Mar 22 at 18:25

$begingroup$
both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
$endgroup$
– oW_
Mar 22 at 18:27

add a comment |

Suppose I want to predict some event probability with a set of features. Some of the features may be gender or some other categorical variable. Assume the probabilities are well calibrated.

Now, say, I need to report the average predicted probability for males and females (and maybe some other subset of features).

Is there a difference between

averaging the probabilities of the full model or

simply fitting a model on only the subset of variables?

And if so why? The full model would outperform the simpler model by a large margin. An example would also be appreciated.

asked Mar 22 at 17:16

oW_

3,261732

$begingroup$
Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
$endgroup$
– Esmailian
Mar 22 at 18:04

$begingroup$
sorry, not sure what you are looking for?!
$endgroup$
– oW_
Mar 22 at 18:11

$begingroup$
For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
$endgroup$
– Esmailian
Mar 22 at 18:25

$begingroup$
both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
$endgroup$
– oW_
Mar 22 at 18:27

add a comment |

Suppose I want to predict some event probability with a set of features. Some of the features may be gender or some other categorical variable. Assume the probabilities are well calibrated.

Now, say, I need to report the average predicted probability for males and females (and maybe some other subset of features).

Is there a difference between

averaging the probabilities of the full model or

simply fitting a model on only the subset of variables?

And if so why? The full model would outperform the simpler model by a large margin. An example would also be appreciated.

asked Mar 22 at 17:16

oW_

3,261732

Suppose I want to predict some event probability with a set of features. Some of the features may be gender or some other categorical variable. Assume the probabilities are well calibrated.

Now, say, I need to report the average predicted probability for males and females (and maybe some other subset of features).

Is there a difference between

averaging the probabilities of the full model or

simply fitting a model on only the subset of variables?

And if so why? The full model would outperform the simpler model by a large margin. An example would also be appreciated.

machine-learning predictive-modeling statistics

asked Mar 22 at 17:16

oW_

3,261732

asked Mar 22 at 17:16

oW_

3,261732

asked Mar 22 at 17:16

oW_

3,261732

asked Mar 22 at 17:16

oW_

3,261732

asked Mar 22 at 17:16

oW_

3,261732

$begingroup$
Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
$endgroup$
– Esmailian
Mar 22 at 18:04

$begingroup$
sorry, not sure what you are looking for?!
$endgroup$
– oW_
Mar 22 at 18:11

$begingroup$
For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
$endgroup$
– Esmailian
Mar 22 at 18:25

$begingroup$
both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
$endgroup$
– oW_
Mar 22 at 18:27

add a comment |

$begingroup$
Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?
$endgroup$
– Esmailian
Mar 22 at 18:04

$begingroup$
sorry, not sure what you are looking for?!
$endgroup$
– oW_
Mar 22 at 18:11

$begingroup$
For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?
$endgroup$
– Esmailian
Mar 22 at 18:25

$begingroup$
both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on
$endgroup$
– oW_
Mar 22 at 18:27

Could you please clarify (1) and (2) with an example like $P(event|textmale)$ vs ....?

– Esmailian
Mar 22 at 18:04

sorry, not sure what you are looking for?!

– oW_
Mar 22 at 18:11

For example $P(event|x_1)$ or $AVG_x_n:(x_n,1)P(event|x_n,1)$ vs averaging over all possible feature values $AVG_x_2 P(e|x_1, x_2)$ or all data points $AVG_x_n:(x_n,1, x_n,2)P(e|x_n,1, x_n,2)$ ?

– Esmailian
Mar 22 at 18:25

both are basically a group by male/female + avg(prob) but in 1. I fit the model on other features that are not in the "group by" whereas 2. considers just features that I'll be grouping on

– oW_
Mar 22 at 18:27

add a comment |

1 Answer
1

active

oldest

votes

A very thought provoking question.

Surprisingly, second approach (subset) is better in theory. The first one (full) is an unbiased estimator of second one (subset). That is, expectation of average of probabilities in full model is equal to the probability in subset model.

I said "in theory", because we assume that full and subset models estimate the true P(event | full) and P(event | subset) exactly respectively (zero error, perfect generalization, etc.). In practice, choice of learning algorithm and training set affects how accurate they can fulfill this assumption. A model could be Naive Bayes, Logistic Regression, Neural Network with a SoftMax layer, etc.

Proof

Let

$F_f$ denote the full features,

$F_s$ denote the subset features (that we group on),

$F_d=F_f-F_s$ denote the features that we average out,

$boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and

$A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.

Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$

where $P(e|F_s=a)$ is the output of second approach.

Example

To verify, here is an example:

$e=texthas_job$

$F_f=textsex, texthas_degree$

$F_s=textsex$

$F_d=texthas_degree$

Full model sees:

x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0

4 m 0 0
5 m 0 0
6 m 1 1

And builds

sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0

And answers $P(texthas_job=1|textsex=f)$ with

$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$

and answers $P(texthas_job=1|textsex=m)$ with

$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$

On the other hand, subset model sees:

x sex has_job
1 f 0
2 f 1
3 f 0

4 m 0
5 m 0
6 m 1

And builds

sex P(has_job=1|sex)
f 0.33
m 0.33

And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.

edited Mar 23 at 0:08

answered Mar 22 at 17:42

Esmailian

1,976216

$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11

$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19

$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35

$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47806%2fgrouping-probabilities-of-full-model-vs-fitting-model-on-subset-of-variables%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

A very thought provoking question.

Proof

Let

$F_f$ denote the full features,

$F_s$ denote the subset features (that we group on),

$F_d=F_f-F_s$ denote the features that we average out,

$boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and

$A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.

Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$

where $P(e|F_s=a)$ is the output of second approach.

Example

To verify, here is an example:

$e=texthas_job$

$F_f=textsex, texthas_degree$

$F_s=textsex$

$F_d=texthas_degree$

Full model sees:

x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0

4 m 0 0
5 m 0 0
6 m 1 1

And builds

sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0

And answers $P(texthas_job=1|textsex=f)$ with

$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$

and answers $P(texthas_job=1|textsex=m)$ with

$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$

On the other hand, subset model sees:

x sex has_job
1 f 0
2 f 1
3 f 0

4 m 0
5 m 0
6 m 1

And builds

sex P(has_job=1|sex)
f 0.33
m 0.33

And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.

edited Mar 23 at 0:08

answered Mar 22 at 17:42

Esmailian

1,976216

$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11

$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19

$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35

$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46

add a comment |

A very thought provoking question.

Proof

Let

$F_f$ denote the full features,

$F_s$ denote the subset features (that we group on),

$F_d=F_f-F_s$ denote the features that we average out,

$boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and

$A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.

Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$

where $P(e|F_s=a)$ is the output of second approach.

Example

To verify, here is an example:

$e=texthas_job$

$F_f=textsex, texthas_degree$

$F_s=textsex$

$F_d=texthas_degree$

Full model sees:

x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0

4 m 0 0
5 m 0 0
6 m 1 1

And builds

sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0

And answers $P(texthas_job=1|textsex=f)$ with

$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$

and answers $P(texthas_job=1|textsex=m)$ with

$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$

On the other hand, subset model sees:

x sex has_job
1 f 0
2 f 1
3 f 0

4 m 0
5 m 0
6 m 1

And builds

sex P(has_job=1|sex)
f 0.33
m 0.33

And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.

edited Mar 23 at 0:08

answered Mar 22 at 17:42

Esmailian

1,976216

$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11

$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19

$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35

$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46

add a comment |

A very thought provoking question.

Proof

Let

$F_f$ denote the full features,

$F_s$ denote the subset features (that we group on),

$F_d=F_f-F_s$ denote the features that we average out,

$boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and

$A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.

Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$

where $P(e|F_s=a)$ is the output of second approach.

Example

To verify, here is an example:

$e=texthas_job$

$F_f=textsex, texthas_degree$

$F_s=textsex$

$F_d=texthas_degree$

Full model sees:

x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0

4 m 0 0
5 m 0 0
6 m 1 1

And builds

sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0

And answers $P(texthas_job=1|textsex=f)$ with

$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$

and answers $P(texthas_job=1|textsex=m)$ with

$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$

On the other hand, subset model sees:

x sex has_job
1 f 0
2 f 1
3 f 0

4 m 0
5 m 0
6 m 1

And builds

sex P(has_job=1|sex)
f 0.33
m 0.33

And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.

edited Mar 23 at 0:08

answered Mar 22 at 17:42

Esmailian

1,976216

A very thought provoking question.

Proof

Let

$F_f$ denote the full features,

$F_s$ denote the subset features (that we group on),

$F_d=F_f-F_s$ denote the features that we average out,

$boldsymbolA=F_s(x)=a$ denote the set of all possible instances with subset features equal to $a$, and

$A subset boldsymbolA$ denote the observed subset of $boldsymbolA$ in training set.

Here is the average of probabilities in full model (first approach):
$$beginalign*
P_f(e|F_s=a)&=frac1N_Asum_xin AP(e|F_d(x),a)
endalign*$$

where $P(e|F_s=a)$ is the output of second approach.

Example

To verify, here is an example:

$e=texthas_job$

$F_f=textsex, texthas_degree$

$F_s=textsex$

$F_d=texthas_degree$

Full model sees:

x sex has_degree has_job
1 f 1 0
2 f 1 1
3 f 0 0

4 m 0 0
5 m 0 0
6 m 1 1

And builds

sex has_degree P(has_job=1|sex, has_degree)
f 0 0
f 1 0.5
m 0 0
m 1 1.0

And answers $P(texthas_job=1|textsex=f)$ with

$$frac13sum_x in 1, 2, 3P(texthas_job=1|x)=(0.5+0.5+0)/3=0.33,$$

and answers $P(texthas_job=1|textsex=m)$ with

$$frac13sum_x in 4, 5, 6P(texthas_job=1|x)=(0+0+1.0)/3=0.33$$

On the other hand, subset model sees:

x sex has_job
1 f 0
2 f 1
3 f 0

4 m 0
5 m 0
6 m 1

And builds

sex P(has_job=1|sex)
f 0.33
m 0.33

And answers $P(texthas_job=1|textsex=f)$ with 0.33, and $P(texthas_job=1|textsex=m)$ with 0.33.

edited Mar 23 at 0:08

answered Mar 22 at 17:42

Esmailian

1,976216

edited Mar 23 at 0:08

answered Mar 22 at 17:42

Esmailian

1,976216

answered Mar 22 at 17:42

Esmailian

1,976216

answered Mar 22 at 17:42

Esmailian

1,976216

$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11

$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19

$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35

$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46

add a comment |

$begingroup$
Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).
$endgroup$
– oW_
Mar 22 at 23:11

$begingroup$
@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.
$endgroup$
– Esmailian
Mar 22 at 23:19

$begingroup$
You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.
$endgroup$
– oW_
Mar 22 at 23:35

$begingroup$
@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.
$endgroup$
– Esmailian
Mar 22 at 23:46

Thank you for the answer! I understand that the two are the same when you look at counts of the events. They will be the same no matter in what order you perform the aggregation. I'm talking about probabilities that stem from a machine learning model (say some neural net etc.).

– oW_
Mar 22 at 23:11

@oW_ My pleasure, A neural network is trying to estimate P(e|~), so the final result would be the same. The shortcoming of NN to model P(e|~) is error. I think, there is not much to say about the error either mathematically or by example.

– Esmailian
Mar 22 at 23:19

You seem to argue that the correct probabilities are based on counts, whereas a model would simply approximate them. Then there would be no need for machine learning at all. A ML model provides better generalization and should provide more robust probabilities (e.g. in the case where you have many sparse categories because it can borrow strength from related categories). My question is whether or not on average they are also more accurate, or if that higher accuracy of the full model over the subset model gets lost when aggregating. Hope this helps.

– oW_
Mar 22 at 23:35

@oW_ I think generalization power of an ML model is relevant to "how accurate model estimates P(e|~)?", meaning: better generalization, better estimation, less error. We can compare the two approach by assuming the perfect generalization (i.e. no error), otherwise we cannot arrive at a conclusion, or give an example.

– Esmailian
Mar 22 at 23:46

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

SprOf,F,p

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1