is it bad to have many different measurements for the same target variable?2019 Community Moderator ElectionBinary classification with unexplained dataDoes variation in data density over time affect regression models?Consistently inconsistent cross-validation results that are wildly different from original model accuracyHow to handle the target variable being in the featuresIs removing poorly predicted data points a valid approach?Is it valid to include your validation data in your vocabulary for NLP?How to apply machine learning model to new datasetClarification about Normalized Discounted Cumulative Gain (NDCG) together with Regression for Ranking?How important is it for each row of data to have the same number of features?How do I correctly build model on given data to predict target parameter?
In 'Revenger,' what does 'cove' come from?
How to set continue counter from another counter (latex)?
Do Iron Man suits sport waste management systems?
What is a Samsaran Word™?
Mathematica command that allows it to read my intentions
Is it possible for a PC to dismember a humanoid?
Array of objects return object when condition matched
Venezuelan girlfriend wants to travel the USA to be with me. What is the process?
What historical events would have to change in order to make 19th century "steampunk" technology possible?
Forgetting the musical notes while performing in concert
Get order collection by order id in Magento 2?
How dangerous is XSS
How to prevent "they're falling in love" trope
Could the museum Saturn V's be refitted for one more flight?
Is it possible to mathematically extract an AES key from black-box encrypt/decrypt hardware?
Personal Teleportation: From Rags to Riches
Where would I need my direct neural interface to be implanted?
Detention in 1997
Different meanings of こわい
Simple macro for new # symbol
How to properly check if the given string is empty in a POSIX shell script?
Should I tell management that I intend to leave due to bad software development practices?
How can a day be exactly 24 hours long?
How can I deal with my CEO asking me to hire someone with a higher salary than me, a co-founder?
is it bad to have many different measurements for the same target variable?
2019 Community Moderator ElectionBinary classification with unexplained dataDoes variation in data density over time affect regression models?Consistently inconsistent cross-validation results that are wildly different from original model accuracyHow to handle the target variable being in the featuresIs removing poorly predicted data points a valid approach?Is it valid to include your validation data in your vocabulary for NLP?How to apply machine learning model to new datasetClarification about Normalized Discounted Cumulative Gain (NDCG) together with Regression for Ranking?How important is it for each row of data to have the same number of features?How do I correctly build model on given data to predict target parameter?
$begingroup$
I'm working on a dataset that has repeated measurements for the same target variable.
When I don't change anything and create model, cross validation overfits with 0.99 score but in testset it gives around 0.39.
When I use mean, std, skew, quartiles for each measurement to have only one measurement for each feature, it gives a much better score.
Can anyone explain to me why? and when it is good to use the second method?
the original dataset looks like this (all numbers are fake):
id /measurement1/measurement2/.../target/
0-1/0.18283 /0.12855 /.../ 1 /
0-2/0.1141 /0.38484 /.../ 1 /
0-3/0.4475 /0.18374 /.../ 1 /
and transformed dataset looks like this:
id /meas1_avg/meas1_std/meas1_skew/meas2_avg/meas2_std/.../target/
0 /0.28747 /0.183848/ 0.198384 /0.18484 /0.28474 /.../ 1 /
machine-learning feature-engineering data-science-model
$endgroup$
add a comment |
$begingroup$
I'm working on a dataset that has repeated measurements for the same target variable.
When I don't change anything and create model, cross validation overfits with 0.99 score but in testset it gives around 0.39.
When I use mean, std, skew, quartiles for each measurement to have only one measurement for each feature, it gives a much better score.
Can anyone explain to me why? and when it is good to use the second method?
the original dataset looks like this (all numbers are fake):
id /measurement1/measurement2/.../target/
0-1/0.18283 /0.12855 /.../ 1 /
0-2/0.1141 /0.38484 /.../ 1 /
0-3/0.4475 /0.18374 /.../ 1 /
and transformed dataset looks like this:
id /meas1_avg/meas1_std/meas1_skew/meas2_avg/meas2_std/.../target/
0 /0.28747 /0.183848/ 0.198384 /0.18484 /0.28474 /.../ 1 /
machine-learning feature-engineering data-science-model
$endgroup$
add a comment |
$begingroup$
I'm working on a dataset that has repeated measurements for the same target variable.
When I don't change anything and create model, cross validation overfits with 0.99 score but in testset it gives around 0.39.
When I use mean, std, skew, quartiles for each measurement to have only one measurement for each feature, it gives a much better score.
Can anyone explain to me why? and when it is good to use the second method?
the original dataset looks like this (all numbers are fake):
id /measurement1/measurement2/.../target/
0-1/0.18283 /0.12855 /.../ 1 /
0-2/0.1141 /0.38484 /.../ 1 /
0-3/0.4475 /0.18374 /.../ 1 /
and transformed dataset looks like this:
id /meas1_avg/meas1_std/meas1_skew/meas2_avg/meas2_std/.../target/
0 /0.28747 /0.183848/ 0.198384 /0.18484 /0.28474 /.../ 1 /
machine-learning feature-engineering data-science-model
$endgroup$
I'm working on a dataset that has repeated measurements for the same target variable.
When I don't change anything and create model, cross validation overfits with 0.99 score but in testset it gives around 0.39.
When I use mean, std, skew, quartiles for each measurement to have only one measurement for each feature, it gives a much better score.
Can anyone explain to me why? and when it is good to use the second method?
the original dataset looks like this (all numbers are fake):
id /measurement1/measurement2/.../target/
0-1/0.18283 /0.12855 /.../ 1 /
0-2/0.1141 /0.38484 /.../ 1 /
0-3/0.4475 /0.18374 /.../ 1 /
and transformed dataset looks like this:
id /meas1_avg/meas1_std/meas1_skew/meas2_avg/meas2_std/.../target/
0 /0.28747 /0.183848/ 0.198384 /0.18484 /0.28474 /.../ 1 /
machine-learning feature-engineering data-science-model
machine-learning feature-engineering data-science-model
asked Mar 26 at 14:58
edunlimitedunlimit
203
203
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Note that you are solving two different problems here.
In the first problem, you want to predict the target variable given one noisy measurement.
In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.
Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.
Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48036%2fis-it-bad-to-have-many-different-measurements-for-the-same-target-variable%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Note that you are solving two different problems here.
In the first problem, you want to predict the target variable given one noisy measurement.
In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.
Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.
Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.
$endgroup$
add a comment |
$begingroup$
Note that you are solving two different problems here.
In the first problem, you want to predict the target variable given one noisy measurement.
In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.
Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.
Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.
$endgroup$
add a comment |
$begingroup$
Note that you are solving two different problems here.
In the first problem, you want to predict the target variable given one noisy measurement.
In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.
Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.
Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.
$endgroup$
Note that you are solving two different problems here.
In the first problem, you want to predict the target variable given one noisy measurement.
In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.
Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.
Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.
edited Mar 26 at 18:09
answered Mar 26 at 15:06
EsmailianEsmailian
2,536318
2,536318
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48036%2fis-it-bad-to-have-many-different-measurements-for-the-same-target-variable%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown