what to do if the missing data in one column is based on some value/condition in another column in r?Supervised Learning with Necessarily Missing DataHow does SQL Server Analysis Services compare to R?Method for predicting price based on Geographical market, Product, and CompanyChoice of replacing missing values based on the data distributionFix missing data by adding another feature instead of using the mean?Prediction questions related to the datasetWhat is the difference between Missing at Random and Missing not at Random data?How to deal with missing data for only some categoriesInvestigate why data is missing? After finding out reasons, what should I do next?Training on data with inherently non-applicable data cells

Why is "breaking the mould" positively connoted?

Why isn't nylon as strong as kevlar?

Is there an idiom that support the idea that "inflation is bad"?

Copy previous line to current line from text file

Is the set of non invertible matrices simply connected? What are their homotopy and homology groups?

Validation rule Scheduled Apex

How do LIGO and VIRGO know that a gravitational wave has its origin in a neutron star or a black hole?

Why has the UK chosen to use Huawei infrastructure when Five Eyes allies haven't?

What is the closest airport to the center of the city it serves?

Where can I go to avoid planes overhead?

Introducing Gladys, an intrepid globetrotter

Does a card have a keyword if it has the same effect as said keyword?

Will 700 more planes a day fly because of the Heathrow expansion?

What does this colon mean? It is not labeling, it is not ternary operator

Do you know any research on finding closed forms of recursively-defined sequences?

What are the differences between credential stuffing and password spraying?

How long would it take for people to notice a mass disappearance?

Formating an equation

How can I get people to remember my character's gender?

Why do people keep telling me that I am a bad photographer?

How I can I roll a number of non-digital dice to get a random number between 1 and 150?

My rubber roof has developed a leak. Can it be fixed?

I have a unique character that I'm having a problem writing. He's a virus!

How can I close a gap between my fence and my neighbor's that's on his side of the property line?



what to do if the missing data in one column is based on some value/condition in another column in r?


Supervised Learning with Necessarily Missing DataHow does SQL Server Analysis Services compare to R?Method for predicting price based on Geographical market, Product, and CompanyChoice of replacing missing values based on the data distributionFix missing data by adding another feature instead of using the mean?Prediction questions related to the datasetWhat is the difference between Missing at Random and Missing not at Random data?How to deal with missing data for only some categoriesInvestigate why data is missing? After finding out reasons, what should I do next?Training on data with inherently non-applicable data cells













0












$begingroup$


I have a dataset with 20,000 observations and 19 variables. To start off with I have a gender column which has three levels namely 'M', 'F' and 'U' where U can be taken as not disclosed. Whenever there is a 'U' in the gender column, there is an NA in two of the other columns namely Age and Tenure. This could basically be interpreted as a person who is not ready to disclose their Gender is not ready to disclose their age and tenure. How do I deal with such a situation? Apart from these three columns there are other 16 columns in the dataset that have got meaningful data in them. Would the normal imputation techniques out there like a KNN Imputation help me out in such a case?



Here is my reproducible example that I have tried my best with:



x<-data.frame(gender=c('M','M','F','F','U','F','M','U'),age=c(21,24,20,34,NA,40,56,NA),tenure=c(7,4,5,3,NA,2,4,NA),job=c('Doctor','IT','Banking','Truck Driver','Finance','Agriculture','Electrician','Teacher'),country=c('Australia','America','New Zealand','Sweden','England','France','Denmark','Norway'))


The Dataframe:



 gender age tenure job country
1 M 21 7 Doctor Australia
2 M 24 4 IT America
3 F 20 5 Banking New Zealand
4 F 34 3 Truck Driver Sweden
5 U NA NA Finance England
6 F 40 2 Agriculture France
7 M 56 4 Electrician Denmark
8 U NA NA Teacher Norway


As you can see from the example above whenever the gender is undefined, there are missing values in both age and tenure and this is the case overall in the entire dataset. What would be the best way to deal with such a situation? And this what is called a Missing at Random data, is that right? Any suggestions would be extremely helpful. Thanks a lot.










share|improve this question









$endgroup$











  • $begingroup$
    This is Missing Not At Random.
    $endgroup$
    – user2974951
    Dec 11 '18 at 8:31
















0












$begingroup$


I have a dataset with 20,000 observations and 19 variables. To start off with I have a gender column which has three levels namely 'M', 'F' and 'U' where U can be taken as not disclosed. Whenever there is a 'U' in the gender column, there is an NA in two of the other columns namely Age and Tenure. This could basically be interpreted as a person who is not ready to disclose their Gender is not ready to disclose their age and tenure. How do I deal with such a situation? Apart from these three columns there are other 16 columns in the dataset that have got meaningful data in them. Would the normal imputation techniques out there like a KNN Imputation help me out in such a case?



Here is my reproducible example that I have tried my best with:



x<-data.frame(gender=c('M','M','F','F','U','F','M','U'),age=c(21,24,20,34,NA,40,56,NA),tenure=c(7,4,5,3,NA,2,4,NA),job=c('Doctor','IT','Banking','Truck Driver','Finance','Agriculture','Electrician','Teacher'),country=c('Australia','America','New Zealand','Sweden','England','France','Denmark','Norway'))


The Dataframe:



 gender age tenure job country
1 M 21 7 Doctor Australia
2 M 24 4 IT America
3 F 20 5 Banking New Zealand
4 F 34 3 Truck Driver Sweden
5 U NA NA Finance England
6 F 40 2 Agriculture France
7 M 56 4 Electrician Denmark
8 U NA NA Teacher Norway


As you can see from the example above whenever the gender is undefined, there are missing values in both age and tenure and this is the case overall in the entire dataset. What would be the best way to deal with such a situation? And this what is called a Missing at Random data, is that right? Any suggestions would be extremely helpful. Thanks a lot.










share|improve this question









$endgroup$











  • $begingroup$
    This is Missing Not At Random.
    $endgroup$
    – user2974951
    Dec 11 '18 at 8:31














0












0








0





$begingroup$


I have a dataset with 20,000 observations and 19 variables. To start off with I have a gender column which has three levels namely 'M', 'F' and 'U' where U can be taken as not disclosed. Whenever there is a 'U' in the gender column, there is an NA in two of the other columns namely Age and Tenure. This could basically be interpreted as a person who is not ready to disclose their Gender is not ready to disclose their age and tenure. How do I deal with such a situation? Apart from these three columns there are other 16 columns in the dataset that have got meaningful data in them. Would the normal imputation techniques out there like a KNN Imputation help me out in such a case?



Here is my reproducible example that I have tried my best with:



x<-data.frame(gender=c('M','M','F','F','U','F','M','U'),age=c(21,24,20,34,NA,40,56,NA),tenure=c(7,4,5,3,NA,2,4,NA),job=c('Doctor','IT','Banking','Truck Driver','Finance','Agriculture','Electrician','Teacher'),country=c('Australia','America','New Zealand','Sweden','England','France','Denmark','Norway'))


The Dataframe:



 gender age tenure job country
1 M 21 7 Doctor Australia
2 M 24 4 IT America
3 F 20 5 Banking New Zealand
4 F 34 3 Truck Driver Sweden
5 U NA NA Finance England
6 F 40 2 Agriculture France
7 M 56 4 Electrician Denmark
8 U NA NA Teacher Norway


As you can see from the example above whenever the gender is undefined, there are missing values in both age and tenure and this is the case overall in the entire dataset. What would be the best way to deal with such a situation? And this what is called a Missing at Random data, is that right? Any suggestions would be extremely helpful. Thanks a lot.










share|improve this question









$endgroup$




I have a dataset with 20,000 observations and 19 variables. To start off with I have a gender column which has three levels namely 'M', 'F' and 'U' where U can be taken as not disclosed. Whenever there is a 'U' in the gender column, there is an NA in two of the other columns namely Age and Tenure. This could basically be interpreted as a person who is not ready to disclose their Gender is not ready to disclose their age and tenure. How do I deal with such a situation? Apart from these three columns there are other 16 columns in the dataset that have got meaningful data in them. Would the normal imputation techniques out there like a KNN Imputation help me out in such a case?



Here is my reproducible example that I have tried my best with:



x<-data.frame(gender=c('M','M','F','F','U','F','M','U'),age=c(21,24,20,34,NA,40,56,NA),tenure=c(7,4,5,3,NA,2,4,NA),job=c('Doctor','IT','Banking','Truck Driver','Finance','Agriculture','Electrician','Teacher'),country=c('Australia','America','New Zealand','Sweden','England','France','Denmark','Norway'))


The Dataframe:



 gender age tenure job country
1 M 21 7 Doctor Australia
2 M 24 4 IT America
3 F 20 5 Banking New Zealand
4 F 34 3 Truck Driver Sweden
5 U NA NA Finance England
6 F 40 2 Agriculture France
7 M 56 4 Electrician Denmark
8 U NA NA Teacher Norway


As you can see from the example above whenever the gender is undefined, there are missing values in both age and tenure and this is the case overall in the entire dataset. What would be the best way to deal with such a situation? And this what is called a Missing at Random data, is that right? Any suggestions would be extremely helpful. Thanks a lot.







machine-learning r data-mining missing-data data-imputation






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Sep 11 '18 at 16:18









The Asipiring oneThe Asipiring one

211




211











  • $begingroup$
    This is Missing Not At Random.
    $endgroup$
    – user2974951
    Dec 11 '18 at 8:31

















  • $begingroup$
    This is Missing Not At Random.
    $endgroup$
    – user2974951
    Dec 11 '18 at 8:31
















$begingroup$
This is Missing Not At Random.
$endgroup$
– user2974951
Dec 11 '18 at 8:31





$begingroup$
This is Missing Not At Random.
$endgroup$
– user2974951
Dec 11 '18 at 8:31











2 Answers
2






active

oldest

votes


















0












$begingroup$

You kinda answered your question, if you pay close attention you already now the value the gender U has in your dataset. Why don't you train a different classifier (thus, building 2) for the gender U and then choose the classifier based if the user meets the condition of it's gender being "U" or not?.



Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.






share|improve this answer











$endgroup$












  • $begingroup$
    It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
    $endgroup$
    – The Asipiring one
    Sep 12 '18 at 4:21


















0












$begingroup$

If you plan to use LightGBM or XGBoost, the advise is "Do not do anything". These methods treat NA in a specific way, different in each decision tree and the results obtained are usually much better than using imputing.






share|improve this answer











$endgroup$













    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "557"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38108%2fwhat-to-do-if-the-missing-data-in-one-column-is-based-on-some-value-condition-in%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    You kinda answered your question, if you pay close attention you already now the value the gender U has in your dataset. Why don't you train a different classifier (thus, building 2) for the gender U and then choose the classifier based if the user meets the condition of it's gender being "U" or not?.



    Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.






    share|improve this answer











    $endgroup$












    • $begingroup$
      It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
      $endgroup$
      – The Asipiring one
      Sep 12 '18 at 4:21















    0












    $begingroup$

    You kinda answered your question, if you pay close attention you already now the value the gender U has in your dataset. Why don't you train a different classifier (thus, building 2) for the gender U and then choose the classifier based if the user meets the condition of it's gender being "U" or not?.



    Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.






    share|improve this answer











    $endgroup$












    • $begingroup$
      It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
      $endgroup$
      – The Asipiring one
      Sep 12 '18 at 4:21













    0












    0








    0





    $begingroup$

    You kinda answered your question, if you pay close attention you already now the value the gender U has in your dataset. Why don't you train a different classifier (thus, building 2) for the gender U and then choose the classifier based if the user meets the condition of it's gender being "U" or not?.



    Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.






    share|improve this answer











    $endgroup$



    You kinda answered your question, if you pay close attention you already now the value the gender U has in your dataset. Why don't you train a different classifier (thus, building 2) for the gender U and then choose the classifier based if the user meets the condition of it's gender being "U" or not?.



    Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Sep 11 '18 at 17:35

























    answered Sep 11 '18 at 17:27









    Felipe BormannFelipe Bormann

    36117




    36117











    • $begingroup$
      It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
      $endgroup$
      – The Asipiring one
      Sep 12 '18 at 4:21
















    • $begingroup$
      It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
      $endgroup$
      – The Asipiring one
      Sep 12 '18 at 4:21















    $begingroup$
    It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
    $endgroup$
    – The Asipiring one
    Sep 12 '18 at 4:21




    $begingroup$
    It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
    $endgroup$
    – The Asipiring one
    Sep 12 '18 at 4:21











    0












    $begingroup$

    If you plan to use LightGBM or XGBoost, the advise is "Do not do anything". These methods treat NA in a specific way, different in each decision tree and the results obtained are usually much better than using imputing.






    share|improve this answer











    $endgroup$

















      0












      $begingroup$

      If you plan to use LightGBM or XGBoost, the advise is "Do not do anything". These methods treat NA in a specific way, different in each decision tree and the results obtained are usually much better than using imputing.






      share|improve this answer











      $endgroup$















        0












        0








        0





        $begingroup$

        If you plan to use LightGBM or XGBoost, the advise is "Do not do anything". These methods treat NA in a specific way, different in each decision tree and the results obtained are usually much better than using imputing.






        share|improve this answer











        $endgroup$



        If you plan to use LightGBM or XGBoost, the advise is "Do not do anything". These methods treat NA in a specific way, different in each decision tree and the results obtained are usually much better than using imputing.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Apr 10 at 8:30

























        answered Apr 10 at 8:24









        Grzegorz SionkowskiGrzegorz Sionkowski

        11




        11



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38108%2fwhat-to-do-if-the-missing-data-in-one-column-is-based-on-some-value-condition-in%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

            Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

            Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High