Machine Learning, Imputing values that should be blank2019 Community Moderator ElectionWhere in the workflow should we deal with missing data?Supervised Learning with Necessarily Missing Datais this a classification or clustering problem?Multi-class Classification Task with Input space size n x 1Percentage of missing values so that we can't perform imputationInstead of one-hot encoding a categorical variable, could I profile the data and use the percentile value from it's cumulative density distribution?Are there any methods of supervised learning that return a bitmap instead of a set of parameters?Missing value in continuous variable: Indicator variable vs. Indicator valueProblem with important feature having a lot of missing valueHow to deal with count data in random forest

Get order collection by order id in Magento 2?

Does the Cone of Cold spell freeze water?

Can compressed videos be decoded back to their uncompresed original format?

How seriously should I take size and weight limits of hand luggage?

Was the Stack Exchange "Happy April Fools" page fitting with the '90's code?

Why is it a bad idea to hire a hitman to eliminate most corrupt politicians?

Is it inappropriate for a student to attend their mentor's dissertation defense?

Description list Formatting using enumitem

Placement of More Information/Help Icon button for Radio Buttons

Two tailed t test for two companies' monthly profits

Is it possible to static_assert that a lambda is not generic?

Is it possible to create a QR code using text?

files created then deleted at every second in tmp directory

What do you call someone who asks many questions?

What is the fastest integer factorization to break RSA?

Is this draw by repetition?

How do conventional missiles fly?

Do Iron Man suits sport waste management systems?

What is the opposite of "eschatology"?

Detention in 1997

Why didn't Boeing produce its own regional jet?

Is there an expression that means doing something right before you will need it rather than doing it in case you might need it?

What's the meaning of "Sollensaussagen"?

Mathematica command that allows it to read my intentions



Machine Learning, Imputing values that should be blank



2019 Community Moderator ElectionWhere in the workflow should we deal with missing data?Supervised Learning with Necessarily Missing Datais this a classification or clustering problem?Multi-class Classification Task with Input space size n x 1Percentage of missing values so that we can't perform imputationInstead of one-hot encoding a categorical variable, could I profile the data and use the percentile value from it's cumulative density distribution?Are there any methods of supervised learning that return a bitmap instead of a set of parameters?Missing value in continuous variable: Indicator variable vs. Indicator valueProblem with important feature having a lot of missing valueHow to deal with count data in random forest










2












$begingroup$


Sometimes data sets contain variables that indicate the presence of an event and the value that represented the event.



As an example say a teacher wants to predict the grades of his students. Some of the students may have been in his class last year and he can use that grade as a variable. However maybe only 20% of the students were in his class so the rest of the 80% will have a Null value. Most ML algorithms cannot accept Null values so the variable would have to somehow be imputed.



I cannot think of an imputation method that would make sense here, the standard mean/mode would imply that all students were in the class and since the variable is pretty unbalance and 80% of the values would be imputed I don't imagine it would hold any valuable information.



Are there any methods to deal with this scenario?










share|improve this question









$endgroup$
















    2












    $begingroup$


    Sometimes data sets contain variables that indicate the presence of an event and the value that represented the event.



    As an example say a teacher wants to predict the grades of his students. Some of the students may have been in his class last year and he can use that grade as a variable. However maybe only 20% of the students were in his class so the rest of the 80% will have a Null value. Most ML algorithms cannot accept Null values so the variable would have to somehow be imputed.



    I cannot think of an imputation method that would make sense here, the standard mean/mode would imply that all students were in the class and since the variable is pretty unbalance and 80% of the values would be imputed I don't imagine it would hold any valuable information.



    Are there any methods to deal with this scenario?










    share|improve this question









    $endgroup$














      2












      2








      2


      1



      $begingroup$


      Sometimes data sets contain variables that indicate the presence of an event and the value that represented the event.



      As an example say a teacher wants to predict the grades of his students. Some of the students may have been in his class last year and he can use that grade as a variable. However maybe only 20% of the students were in his class so the rest of the 80% will have a Null value. Most ML algorithms cannot accept Null values so the variable would have to somehow be imputed.



      I cannot think of an imputation method that would make sense here, the standard mean/mode would imply that all students were in the class and since the variable is pretty unbalance and 80% of the values would be imputed I don't imagine it would hold any valuable information.



      Are there any methods to deal with this scenario?










      share|improve this question









      $endgroup$




      Sometimes data sets contain variables that indicate the presence of an event and the value that represented the event.



      As an example say a teacher wants to predict the grades of his students. Some of the students may have been in his class last year and he can use that grade as a variable. However maybe only 20% of the students were in his class so the rest of the 80% will have a Null value. Most ML algorithms cannot accept Null values so the variable would have to somehow be imputed.



      I cannot think of an imputation method that would make sense here, the standard mean/mode would imply that all students were in the class and since the variable is pretty unbalance and 80% of the values would be imputed I don't imagine it would hold any valuable information.



      Are there any methods to deal with this scenario?







      machine-learning python feature-selection data-imputation






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 25 at 17:08









      Mustard TigerMustard Tiger

      1112




      1112




















          3 Answers
          3






          active

          oldest

          votes


















          1












          $begingroup$

          Well, it seems that you are dealing with sparse data, however imputation is a difficult and often an attempt of imputation can add trivial amount of difference. You may look out on for this link for some approaches like Gharamani and Jordan.

          These are variants of SVM, focused with Sparse nature.






          share|improve this answer









          $endgroup$




















            1












            $begingroup$

            For the specific case of the notes, you could try to transform it by categories, where the null values will have a different category.



            Another option would be to impute by the mean or the median, but previously it would create a binary variable to identify the null values.






            share|improve this answer









            $endgroup$




















              0












              $begingroup$

              Since last year's grade should be an important feature, we should use it whenever it is available.

              I think that a stratified model should work here. Create 2 different models, one for last year students, the other for the remaining 80%. I think maybe there will be other features for the 20% sample, all the others will be in common for the 2 models.






              share|improve this answer









              $endgroup$













                Your Answer





                StackExchange.ifUsing("editor", function ()
                return StackExchange.using("mathjaxEditing", function ()
                StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
                StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
                );
                );
                , "mathjax-editing");

                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "557"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: false,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: null,
                bindNavPrevention: true,
                postfix: "",
                imageUploader:
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                ,
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );













                draft saved

                draft discarded


















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47960%2fmachine-learning-imputing-values-that-should-be-blank%23new-answer', 'question_page');

                );

                Post as a guest















                Required, but never shown

























                3 Answers
                3






                active

                oldest

                votes








                3 Answers
                3






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                1












                $begingroup$

                Well, it seems that you are dealing with sparse data, however imputation is a difficult and often an attempt of imputation can add trivial amount of difference. You may look out on for this link for some approaches like Gharamani and Jordan.

                These are variants of SVM, focused with Sparse nature.






                share|improve this answer









                $endgroup$

















                  1












                  $begingroup$

                  Well, it seems that you are dealing with sparse data, however imputation is a difficult and often an attempt of imputation can add trivial amount of difference. You may look out on for this link for some approaches like Gharamani and Jordan.

                  These are variants of SVM, focused with Sparse nature.






                  share|improve this answer









                  $endgroup$















                    1












                    1








                    1





                    $begingroup$

                    Well, it seems that you are dealing with sparse data, however imputation is a difficult and often an attempt of imputation can add trivial amount of difference. You may look out on for this link for some approaches like Gharamani and Jordan.

                    These are variants of SVM, focused with Sparse nature.






                    share|improve this answer









                    $endgroup$



                    Well, it seems that you are dealing with sparse data, however imputation is a difficult and often an attempt of imputation can add trivial amount of difference. You may look out on for this link for some approaches like Gharamani and Jordan.

                    These are variants of SVM, focused with Sparse nature.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Mar 25 at 17:50









                    T3J45T3J45

                    112




                    112





















                        1












                        $begingroup$

                        For the specific case of the notes, you could try to transform it by categories, where the null values will have a different category.



                        Another option would be to impute by the mean or the median, but previously it would create a binary variable to identify the null values.






                        share|improve this answer









                        $endgroup$

















                          1












                          $begingroup$

                          For the specific case of the notes, you could try to transform it by categories, where the null values will have a different category.



                          Another option would be to impute by the mean or the median, but previously it would create a binary variable to identify the null values.






                          share|improve this answer









                          $endgroup$















                            1












                            1








                            1





                            $begingroup$

                            For the specific case of the notes, you could try to transform it by categories, where the null values will have a different category.



                            Another option would be to impute by the mean or the median, but previously it would create a binary variable to identify the null values.






                            share|improve this answer









                            $endgroup$



                            For the specific case of the notes, you could try to transform it by categories, where the null values will have a different category.



                            Another option would be to impute by the mean or the median, but previously it would create a binary variable to identify the null values.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Mar 25 at 23:29









                            Victor VillacortaVictor Villacorta

                            111




                            111





















                                0












                                $begingroup$

                                Since last year's grade should be an important feature, we should use it whenever it is available.

                                I think that a stratified model should work here. Create 2 different models, one for last year students, the other for the remaining 80%. I think maybe there will be other features for the 20% sample, all the others will be in common for the 2 models.






                                share|improve this answer









                                $endgroup$

















                                  0












                                  $begingroup$

                                  Since last year's grade should be an important feature, we should use it whenever it is available.

                                  I think that a stratified model should work here. Create 2 different models, one for last year students, the other for the remaining 80%. I think maybe there will be other features for the 20% sample, all the others will be in common for the 2 models.






                                  share|improve this answer









                                  $endgroup$















                                    0












                                    0








                                    0





                                    $begingroup$

                                    Since last year's grade should be an important feature, we should use it whenever it is available.

                                    I think that a stratified model should work here. Create 2 different models, one for last year students, the other for the remaining 80%. I think maybe there will be other features for the 20% sample, all the others will be in common for the 2 models.






                                    share|improve this answer









                                    $endgroup$



                                    Since last year's grade should be an important feature, we should use it whenever it is available.

                                    I think that a stratified model should work here. Create 2 different models, one for last year students, the other for the remaining 80%. I think maybe there will be other features for the 20% sample, all the others will be in common for the 2 models.







                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered Mar 25 at 21:19









                                    Matteo FeliciMatteo Felici

                                    1012




                                    1012



























                                        draft saved

                                        draft discarded
















































                                        Thanks for contributing an answer to Data Science Stack Exchange!


                                        • Please be sure to answer the question. Provide details and share your research!

                                        But avoid


                                        • Asking for help, clarification, or responding to other answers.

                                        • Making statements based on opinion; back them up with references or personal experience.

                                        Use MathJax to format equations. MathJax reference.


                                        To learn more, see our tips on writing great answers.




                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function ()
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47960%2fmachine-learning-imputing-values-that-should-be-blank%23new-answer', 'question_page');

                                        );

                                        Post as a guest















                                        Required, but never shown





















































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown

































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown







                                        Popular posts from this blog

                                        Marja Vauras Lähteet | Aiheesta muualla | NavigointivalikkoMarja Vauras Turun yliopiston tutkimusportaalissaInfobox OKSuomalaisen Tiedeakatemian varsinaiset jäsenetKasvatustieteiden tiedekunnan dekaanit ja muu johtoMarja VaurasKoulutusvienti on kestävyys- ja ketteryyslaji (2.5.2017)laajentamallaWorldCat Identities0000 0001 0855 9405n86069603utb201588738523620927

                                        Which is better: GPT or RelGAN for text generation?2019 Community Moderator ElectionWhat is the difference between TextGAN and LM for text generation?GANs (generative adversarial networks) possible for text as well?Generator loss not decreasing- text to image synthesisChoosing a right algorithm for template-based text generationHow should I format input and output for text generation with LSTMsGumbel Softmax vs Vanilla Softmax for GAN trainingWhich neural network to choose for classification from text/speech?NLP text autoencoder that generates text in poetic meterWhat is the interpretation of the expectation notation in the GAN formulation?What is the difference between TextGAN and LM for text generation?How to prepare the data for text generation task

                                        Is this part of the description of the Archfey warlock's Misty Escape feature redundant?When is entropic ward considered “used”?How does the reaction timing work for Wrath of the Storm? Can it potentially prevent the damage from the triggering attack?Does the Dark Arts Archlich warlock patrons's Arcane Invisibility activate every time you cast a level 1+ spell?When attacking while invisible, when exactly does invisibility break?Can I cast Hellish Rebuke on my turn?Do I have to “pre-cast” a reaction spell in order for it to be triggered?What happens if a Player Misty Escapes into an Invisible CreatureCan a reaction interrupt multiattack?Does the Fiend-patron warlock's Hurl Through Hell feature dispel effects that require the target to be on the same plane as the caster?What are you allowed to do while using the Warlock's Eldritch Master feature?