is it bad to have many different measurements for the same target variable?2019 Community Moderator ElectionBinary classification with unexplained dataDoes variation in data density over time affect regression models?Consistently inconsistent cross-validation results that are wildly different from original model accuracyHow to handle the target variable being in the featuresIs removing poorly predicted data points a valid approach?Is it valid to include your validation data in your vocabulary for NLP?How to apply machine learning model to new datasetClarification about Normalized Discounted Cumulative Gain (NDCG) together with Regression for Ranking?How important is it for each row of data to have the same number of features?How do I correctly build model on given data to predict target parameter?

In 'Revenger,' what does 'cove' come from?

How to set continue counter from another counter (latex)?

Do Iron Man suits sport waste management systems?

What is a Samsaran Word™?

Mathematica command that allows it to read my intentions

Is it possible for a PC to dismember a humanoid?

Array of objects return object when condition matched

Venezuelan girlfriend wants to travel the USA to be with me. What is the process?

What historical events would have to change in order to make 19th century "steampunk" technology possible?

Forgetting the musical notes while performing in concert

Get order collection by order id in Magento 2?

How dangerous is XSS

How to prevent "they're falling in love" trope

Could the museum Saturn V's be refitted for one more flight?

Is it possible to mathematically extract an AES key from black-box encrypt/decrypt hardware?

Personal Teleportation: From Rags to Riches

Where would I need my direct neural interface to be implanted?

Detention in 1997

Different meanings of こわい

Simple macro for new # symbol

How to properly check if the given string is empty in a POSIX shell script?

Should I tell management that I intend to leave due to bad software development practices?

How can a day be exactly 24 hours long?

How can I deal with my CEO asking me to hire someone with a higher salary than me, a co-founder?



is it bad to have many different measurements for the same target variable?



2019 Community Moderator ElectionBinary classification with unexplained dataDoes variation in data density over time affect regression models?Consistently inconsistent cross-validation results that are wildly different from original model accuracyHow to handle the target variable being in the featuresIs removing poorly predicted data points a valid approach?Is it valid to include your validation data in your vocabulary for NLP?How to apply machine learning model to new datasetClarification about Normalized Discounted Cumulative Gain (NDCG) together with Regression for Ranking?How important is it for each row of data to have the same number of features?How do I correctly build model on given data to predict target parameter?










1












$begingroup$


I'm working on a dataset that has repeated measurements for the same target variable.



When I don't change anything and create model, cross validation overfits with 0.99 score but in testset it gives around 0.39.



When I use mean, std, skew, quartiles for each measurement to have only one measurement for each feature, it gives a much better score.



Can anyone explain to me why? and when it is good to use the second method?



the original dataset looks like this (all numbers are fake):



id /measurement1/measurement2/.../target/
0-1/0.18283 /0.12855 /.../ 1 /
0-2/0.1141 /0.38484 /.../ 1 /
0-3/0.4475 /0.18374 /.../ 1 /


and transformed dataset looks like this:



id /meas1_avg/meas1_std/meas1_skew/meas2_avg/meas2_std/.../target/
0 /0.28747 /0.183848/ 0.198384 /0.18484 /0.28474 /.../ 1 /









share|improve this question









$endgroup$
















    1












    $begingroup$


    I'm working on a dataset that has repeated measurements for the same target variable.



    When I don't change anything and create model, cross validation overfits with 0.99 score but in testset it gives around 0.39.



    When I use mean, std, skew, quartiles for each measurement to have only one measurement for each feature, it gives a much better score.



    Can anyone explain to me why? and when it is good to use the second method?



    the original dataset looks like this (all numbers are fake):



    id /measurement1/measurement2/.../target/
    0-1/0.18283 /0.12855 /.../ 1 /
    0-2/0.1141 /0.38484 /.../ 1 /
    0-3/0.4475 /0.18374 /.../ 1 /


    and transformed dataset looks like this:



    id /meas1_avg/meas1_std/meas1_skew/meas2_avg/meas2_std/.../target/
    0 /0.28747 /0.183848/ 0.198384 /0.18484 /0.28474 /.../ 1 /









    share|improve this question









    $endgroup$














      1












      1








      1





      $begingroup$


      I'm working on a dataset that has repeated measurements for the same target variable.



      When I don't change anything and create model, cross validation overfits with 0.99 score but in testset it gives around 0.39.



      When I use mean, std, skew, quartiles for each measurement to have only one measurement for each feature, it gives a much better score.



      Can anyone explain to me why? and when it is good to use the second method?



      the original dataset looks like this (all numbers are fake):



      id /measurement1/measurement2/.../target/
      0-1/0.18283 /0.12855 /.../ 1 /
      0-2/0.1141 /0.38484 /.../ 1 /
      0-3/0.4475 /0.18374 /.../ 1 /


      and transformed dataset looks like this:



      id /meas1_avg/meas1_std/meas1_skew/meas2_avg/meas2_std/.../target/
      0 /0.28747 /0.183848/ 0.198384 /0.18484 /0.28474 /.../ 1 /









      share|improve this question









      $endgroup$




      I'm working on a dataset that has repeated measurements for the same target variable.



      When I don't change anything and create model, cross validation overfits with 0.99 score but in testset it gives around 0.39.



      When I use mean, std, skew, quartiles for each measurement to have only one measurement for each feature, it gives a much better score.



      Can anyone explain to me why? and when it is good to use the second method?



      the original dataset looks like this (all numbers are fake):



      id /measurement1/measurement2/.../target/
      0-1/0.18283 /0.12855 /.../ 1 /
      0-2/0.1141 /0.38484 /.../ 1 /
      0-3/0.4475 /0.18374 /.../ 1 /


      and transformed dataset looks like this:



      id /meas1_avg/meas1_std/meas1_skew/meas2_avg/meas2_std/.../target/
      0 /0.28747 /0.183848/ 0.198384 /0.18484 /0.28474 /.../ 1 /






      machine-learning feature-engineering data-science-model






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 26 at 14:58









      edunlimitedunlimit

      203




      203




















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          Note that you are solving two different problems here.



          In the first problem, you want to predict the target variable given one noisy measurement.



          In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.



          Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.



          Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.






          share|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48036%2fis-it-bad-to-have-many-different-measurements-for-the-same-target-variable%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            Note that you are solving two different problems here.



            In the first problem, you want to predict the target variable given one noisy measurement.



            In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.



            Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.



            Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.






            share|improve this answer











            $endgroup$

















              1












              $begingroup$

              Note that you are solving two different problems here.



              In the first problem, you want to predict the target variable given one noisy measurement.



              In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.



              Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.



              Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.






              share|improve this answer











              $endgroup$















                1












                1








                1





                $begingroup$

                Note that you are solving two different problems here.



                In the first problem, you want to predict the target variable given one noisy measurement.



                In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.



                Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.



                Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.






                share|improve this answer











                $endgroup$



                Note that you are solving two different problems here.



                In the first problem, you want to predict the target variable given one noisy measurement.



                In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.



                Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.



                Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Mar 26 at 18:09

























                answered Mar 26 at 15:06









                EsmailianEsmailian

                2,536318




                2,536318



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48036%2fis-it-bad-to-have-many-different-measurements-for-the-same-target-variable%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                    Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                    Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High