is it bad to have many different measurements for the same target variable?2019 Community Moderator ElectionBinary classification with unexplained dataDoes variation in data density over time affect regression models?Consistently inconsistent cross-validation results that are wildly different from original model accuracyHow to handle the target variable being in the featuresIs removing poorly predicted data points a valid approach?Is it valid to include your validation data in your vocabulary for NLP?How to apply machine learning model to new datasetClarification about Normalized Discounted Cumulative Gain (NDCG) together with Regression for Ranking?How important is it for each row of data to have the same number of features?How do I correctly build model on given data to predict target parameter?

In 'Revenger,' what does 'cove' come from?

How to set continue counter from another counter (latex)?

Do Iron Man suits sport waste management systems?

What is a Samsaran Word™?

Mathematica command that allows it to read my intentions

Is it possible for a PC to dismember a humanoid?

Array of objects return object when condition matched

Venezuelan girlfriend wants to travel the USA to be with me. What is the process?

What historical events would have to change in order to make 19th century "steampunk" technology possible?

Forgetting the musical notes while performing in concert

Get order collection by order id in Magento 2?

How dangerous is XSS

How to prevent "they're falling in love" trope

Could the museum Saturn V's be refitted for one more flight?

Is it possible to mathematically extract an AES key from black-box encrypt/decrypt hardware?

Personal Teleportation: From Rags to Riches

Where would I need my direct neural interface to be implanted?

Detention in 1997

Different meanings of こわい

Simple macro for new # symbol

How to properly check if the given string is empty in a POSIX shell script?

Should I tell management that I intend to leave due to bad software development practices?

How can a day be exactly 24 hours long?

How can I deal with my CEO asking me to hire someone with a higher salary than me, a co-founder?



is it bad to have many different measurements for the same target variable?



2019 Community Moderator ElectionBinary classification with unexplained dataDoes variation in data density over time affect regression models?Consistently inconsistent cross-validation results that are wildly different from original model accuracyHow to handle the target variable being in the featuresIs removing poorly predicted data points a valid approach?Is it valid to include your validation data in your vocabulary for NLP?How to apply machine learning model to new datasetClarification about Normalized Discounted Cumulative Gain (NDCG) together with Regression for Ranking?How important is it for each row of data to have the same number of features?How do I correctly build model on given data to predict target parameter?










1












$begingroup$


I'm working on a dataset that has repeated measurements for the same target variable.



When I don't change anything and create model, cross validation overfits with 0.99 score but in testset it gives around 0.39.



When I use mean, std, skew, quartiles for each measurement to have only one measurement for each feature, it gives a much better score.



Can anyone explain to me why? and when it is good to use the second method?



the original dataset looks like this (all numbers are fake):



id /measurement1/measurement2/.../target/
0-1/0.18283 /0.12855 /.../ 1 /
0-2/0.1141 /0.38484 /.../ 1 /
0-3/0.4475 /0.18374 /.../ 1 /


and transformed dataset looks like this:



id /meas1_avg/meas1_std/meas1_skew/meas2_avg/meas2_std/.../target/
0 /0.28747 /0.183848/ 0.198384 /0.18484 /0.28474 /.../ 1 /









share|improve this question









$endgroup$
















    1












    $begingroup$


    I'm working on a dataset that has repeated measurements for the same target variable.



    When I don't change anything and create model, cross validation overfits with 0.99 score but in testset it gives around 0.39.



    When I use mean, std, skew, quartiles for each measurement to have only one measurement for each feature, it gives a much better score.



    Can anyone explain to me why? and when it is good to use the second method?



    the original dataset looks like this (all numbers are fake):



    id /measurement1/measurement2/.../target/
    0-1/0.18283 /0.12855 /.../ 1 /
    0-2/0.1141 /0.38484 /.../ 1 /
    0-3/0.4475 /0.18374 /.../ 1 /


    and transformed dataset looks like this:



    id /meas1_avg/meas1_std/meas1_skew/meas2_avg/meas2_std/.../target/
    0 /0.28747 /0.183848/ 0.198384 /0.18484 /0.28474 /.../ 1 /









    share|improve this question









    $endgroup$














      1












      1








      1





      $begingroup$


      I'm working on a dataset that has repeated measurements for the same target variable.



      When I don't change anything and create model, cross validation overfits with 0.99 score but in testset it gives around 0.39.



      When I use mean, std, skew, quartiles for each measurement to have only one measurement for each feature, it gives a much better score.



      Can anyone explain to me why? and when it is good to use the second method?



      the original dataset looks like this (all numbers are fake):



      id /measurement1/measurement2/.../target/
      0-1/0.18283 /0.12855 /.../ 1 /
      0-2/0.1141 /0.38484 /.../ 1 /
      0-3/0.4475 /0.18374 /.../ 1 /


      and transformed dataset looks like this:



      id /meas1_avg/meas1_std/meas1_skew/meas2_avg/meas2_std/.../target/
      0 /0.28747 /0.183848/ 0.198384 /0.18484 /0.28474 /.../ 1 /









      share|improve this question









      $endgroup$




      I'm working on a dataset that has repeated measurements for the same target variable.



      When I don't change anything and create model, cross validation overfits with 0.99 score but in testset it gives around 0.39.



      When I use mean, std, skew, quartiles for each measurement to have only one measurement for each feature, it gives a much better score.



      Can anyone explain to me why? and when it is good to use the second method?



      the original dataset looks like this (all numbers are fake):



      id /measurement1/measurement2/.../target/
      0-1/0.18283 /0.12855 /.../ 1 /
      0-2/0.1141 /0.38484 /.../ 1 /
      0-3/0.4475 /0.18374 /.../ 1 /


      and transformed dataset looks like this:



      id /meas1_avg/meas1_std/meas1_skew/meas2_avg/meas2_std/.../target/
      0 /0.28747 /0.183848/ 0.198384 /0.18484 /0.28474 /.../ 1 /






      machine-learning feature-engineering data-science-model






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 26 at 14:58









      edunlimitedunlimit

      203




      203




















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          Note that you are solving two different problems here.



          In the first problem, you want to predict the target variable given one noisy measurement.



          In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.



          Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.



          Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.






          share|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48036%2fis-it-bad-to-have-many-different-measurements-for-the-same-target-variable%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            Note that you are solving two different problems here.



            In the first problem, you want to predict the target variable given one noisy measurement.



            In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.



            Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.



            Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.






            share|improve this answer











            $endgroup$

















              1












              $begingroup$

              Note that you are solving two different problems here.



              In the first problem, you want to predict the target variable given one noisy measurement.



              In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.



              Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.



              Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.






              share|improve this answer











              $endgroup$















                1












                1








                1





                $begingroup$

                Note that you are solving two different problems here.



                In the first problem, you want to predict the target variable given one noisy measurement.



                In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.



                Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.



                Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.






                share|improve this answer











                $endgroup$



                Note that you are solving two different problems here.



                In the first problem, you want to predict the target variable given one noisy measurement.



                In the second problem, you want to predict the target variable given some statistics from a group of noisy measurements.



                Your results show that the second problem is easier to solve which is intuitive, since the amount of noise (variance) for average of multiple measurements is less than only one measurement (closely related to Law of Large Numbers), thus the relation in the second problem is easier to find by the model.



                Therefore, if both problems are equivalent to you, go with the second problem which is easier to solve.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Mar 26 at 18:09

























                answered Mar 26 at 15:06









                EsmailianEsmailian

                2,536318




                2,536318



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48036%2fis-it-bad-to-have-many-different-measurements-for-the-same-target-variable%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Marja Vauras Lähteet | Aiheesta muualla | NavigointivalikkoMarja Vauras Turun yliopiston tutkimusportaalissaInfobox OKSuomalaisen Tiedeakatemian varsinaiset jäsenetKasvatustieteiden tiedekunnan dekaanit ja muu johtoMarja VaurasKoulutusvienti on kestävyys- ja ketteryyslaji (2.5.2017)laajentamallaWorldCat Identities0000 0001 0855 9405n86069603utb201588738523620927

                    Which is better: GPT or RelGAN for text generation?2019 Community Moderator ElectionWhat is the difference between TextGAN and LM for text generation?GANs (generative adversarial networks) possible for text as well?Generator loss not decreasing- text to image synthesisChoosing a right algorithm for template-based text generationHow should I format input and output for text generation with LSTMsGumbel Softmax vs Vanilla Softmax for GAN trainingWhich neural network to choose for classification from text/speech?NLP text autoencoder that generates text in poetic meterWhat is the interpretation of the expectation notation in the GAN formulation?What is the difference between TextGAN and LM for text generation?How to prepare the data for text generation task

                    Is flight data recorder erased after every flight?When are black boxes used?What protects the location beacon (pinger) of a flight data recorder?Is there anywhere I can pick up raw flight data recorder information?Who legally owns the Flight Data Recorder?Constructing flight recorder dataWhy are FDRs and CVRs still two separate physical devices?What are the data elements shown on the GE235 flight data recorder (FDR) plot?Are CVR and FDR reset after every flight?What is the format of data stored by a Flight Data Recorder?How much data is stored in the flight data recorder per hour in a typical flight of an A380?Is a smart flight data recorder possible?