what to do if the missing data in one column is based on some value/condition in another column in r?Supervised Learning with Necessarily Missing DataHow does SQL Server Analysis Services compare to R?Method for predicting price based on Geographical market, Product, and CompanyChoice of replacing missing values based on the data distributionFix missing data by adding another feature instead of using the mean?Prediction questions related to the datasetWhat is the difference between Missing at Random and Missing not at Random data?How to deal with missing data for only some categoriesInvestigate why data is missing? After finding out reasons, what should I do next?Training on data with inherently non-applicable data cells

Why is "breaking the mould" positively connoted?

Why isn't nylon as strong as kevlar?

Is there an idiom that support the idea that "inflation is bad"?

Copy previous line to current line from text file

Is the set of non invertible matrices simply connected? What are their homotopy and homology groups?

Validation rule Scheduled Apex

How do LIGO and VIRGO know that a gravitational wave has its origin in a neutron star or a black hole?

Why has the UK chosen to use Huawei infrastructure when Five Eyes allies haven't?

What is the closest airport to the center of the city it serves?

Where can I go to avoid planes overhead?

Introducing Gladys, an intrepid globetrotter

Does a card have a keyword if it has the same effect as said keyword?

Will 700 more planes a day fly because of the Heathrow expansion?

What does this colon mean? It is not labeling, it is not ternary operator

Do you know any research on finding closed forms of recursively-defined sequences?

What are the differences between credential stuffing and password spraying?

How long would it take for people to notice a mass disappearance?

Formating an equation

How can I get people to remember my character's gender?

Why do people keep telling me that I am a bad photographer?

How I can I roll a number of non-digital dice to get a random number between 1 and 150?

My rubber roof has developed a leak. Can it be fixed?

I have a unique character that I'm having a problem writing. He's a virus!

How can I close a gap between my fence and my neighbor's that's on his side of the property line?

what to do if the missing data in one column is based on some value/condition in another column in r?

Supervised Learning with Necessarily Missing DataHow does SQL Server Analysis Services compare to R?Method for predicting price based on Geographical market, Product, and CompanyChoice of replacing missing values based on the data distributionFix missing data by adding another feature instead of using the mean?Prediction questions related to the datasetWhat is the difference between Missing at Random and Missing not at Random data?How to deal with missing data for only some categoriesInvestigate why data is missing? After finding out reasons, what should I do next?Training on data with inherently non-applicable data cells

I have a dataset with 20,000 observations and 19 variables. To start off with I have a gender column which has three levels namely 'M', 'F' and 'U' where U can be taken as not disclosed. Whenever there is a 'U' in the gender column, there is an NA in two of the other columns namely Age and Tenure. This could basically be interpreted as a person who is not ready to disclose their Gender is not ready to disclose their age and tenure. How do I deal with such a situation? Apart from these three columns there are other 16 columns in the dataset that have got meaningful data in them. Would the normal imputation techniques out there like a KNN Imputation help me out in such a case?

Here is my reproducible example that I have tried my best with:

x<-data.frame(gender=c('M','M','F','F','U','F','M','U'),age=c(21,24,20,34,NA,40,56,NA),tenure=c(7,4,5,3,NA,2,4,NA),job=c('Doctor','IT','Banking','Truck Driver','Finance','Agriculture','Electrician','Teacher'),country=c('Australia','America','New Zealand','Sweden','England','France','Denmark','Norway'))

The Dataframe:

 gender age tenure job country
1 M 21 7 Doctor Australia
2 M 24 4 IT America
3 F 20 5 Banking New Zealand
4 F 34 3 Truck Driver Sweden
5 U NA NA Finance England
6 F 40 2 Agriculture France
7 M 56 4 Electrician Denmark
8 U NA NA Teacher Norway

As you can see from the example above whenever the gender is undefined, there are missing values in both age and tenure and this is the case overall in the entire dataset. What would be the best way to deal with such a situation? And this what is called a Missing at Random data, is that right? Any suggestions would be extremely helpful. Thanks a lot.

asked Sep 11 '18 at 16:18

The Asipiring one

211

$begingroup$
This is Missing Not At Random.
$endgroup$
– user2974951
Dec 11 '18 at 8:31

add a comment |

Here is my reproducible example that I have tried my best with:

x<-data.frame(gender=c('M','M','F','F','U','F','M','U'),age=c(21,24,20,34,NA,40,56,NA),tenure=c(7,4,5,3,NA,2,4,NA),job=c('Doctor','IT','Banking','Truck Driver','Finance','Agriculture','Electrician','Teacher'),country=c('Australia','America','New Zealand','Sweden','England','France','Denmark','Norway'))

The Dataframe:

 gender age tenure job country
1 M 21 7 Doctor Australia
2 M 24 4 IT America
3 F 20 5 Banking New Zealand
4 F 34 3 Truck Driver Sweden
5 U NA NA Finance England
6 F 40 2 Agriculture France
7 M 56 4 Electrician Denmark
8 U NA NA Teacher Norway

asked Sep 11 '18 at 16:18

The Asipiring one

211

$begingroup$
This is Missing Not At Random.
$endgroup$
– user2974951
Dec 11 '18 at 8:31

add a comment |

Here is my reproducible example that I have tried my best with:

x<-data.frame(gender=c('M','M','F','F','U','F','M','U'),age=c(21,24,20,34,NA,40,56,NA),tenure=c(7,4,5,3,NA,2,4,NA),job=c('Doctor','IT','Banking','Truck Driver','Finance','Agriculture','Electrician','Teacher'),country=c('Australia','America','New Zealand','Sweden','England','France','Denmark','Norway'))

The Dataframe:

 gender age tenure job country
1 M 21 7 Doctor Australia
2 M 24 4 IT America
3 F 20 5 Banking New Zealand
4 F 34 3 Truck Driver Sweden
5 U NA NA Finance England
6 F 40 2 Agriculture France
7 M 56 4 Electrician Denmark
8 U NA NA Teacher Norway

asked Sep 11 '18 at 16:18

The Asipiring one

211

Here is my reproducible example that I have tried my best with:

x<-data.frame(gender=c('M','M','F','F','U','F','M','U'),age=c(21,24,20,34,NA,40,56,NA),tenure=c(7,4,5,3,NA,2,4,NA),job=c('Doctor','IT','Banking','Truck Driver','Finance','Agriculture','Electrician','Teacher'),country=c('Australia','America','New Zealand','Sweden','England','France','Denmark','Norway'))

The Dataframe:

 gender age tenure job country
1 M 21 7 Doctor Australia
2 M 24 4 IT America
3 F 20 5 Banking New Zealand
4 F 34 3 Truck Driver Sweden
5 U NA NA Finance England
6 F 40 2 Agriculture France
7 M 56 4 Electrician Denmark
8 U NA NA Teacher Norway

machine-learning r data-mining missing-data data-imputation

asked Sep 11 '18 at 16:18

The Asipiring one

211

asked Sep 11 '18 at 16:18

The Asipiring one

211

asked Sep 11 '18 at 16:18

The Asipiring one

211

asked Sep 11 '18 at 16:18

The Asipiring one

211

asked Sep 11 '18 at 16:18

The Asipiring one

211

$begingroup$
This is Missing Not At Random.
$endgroup$
– user2974951
Dec 11 '18 at 8:31

add a comment |

$begingroup$
This is Missing Not At Random.
$endgroup$
– user2974951
Dec 11 '18 at 8:31

This is Missing Not At Random.

– user2974951
Dec 11 '18 at 8:31

add a comment |

2 Answers
2

active

oldest

votes

You kinda answered your question, if you pay close attention you already now the value the gender U has in your dataset. Why don't you train a different classifier (thus, building 2) for the gender U and then choose the classifier based if the user meets the condition of it's gender being "U" or not?.

Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.

edited Sep 11 '18 at 17:35

answered Sep 11 '18 at 17:27

Felipe Bormann

36117

$begingroup$
It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
$endgroup$
– The Asipiring one
Sep 12 '18 at 4:21

add a comment |

If you plan to use LightGBM or XGBoost, the advise is "Do not do anything". These methods treat NA in a specific way, different in each decision tree and the results obtained are usually much better than using imputing.

edited Apr 10 at 8:30

answered Apr 10 at 8:24

Grzegorz Sionkowski

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38108%2fwhat-to-do-if-the-missing-data-in-one-column-is-based-on-some-value-condition-in%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.

edited Sep 11 '18 at 17:35

answered Sep 11 '18 at 17:27

Felipe Bormann

36117

$begingroup$
It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
$endgroup$
– The Asipiring one
Sep 12 '18 at 4:21

add a comment |

Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.

edited Sep 11 '18 at 17:35

answered Sep 11 '18 at 17:27

Felipe Bormann

36117

$begingroup$
It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
$endgroup$
– The Asipiring one
Sep 12 '18 at 4:21

add a comment |

Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.

edited Sep 11 '18 at 17:35

answered Sep 11 '18 at 17:27

Felipe Bormann

36117

Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.

edited Sep 11 '18 at 17:35

answered Sep 11 '18 at 17:27

Felipe Bormann

36117

edited Sep 11 '18 at 17:35

answered Sep 11 '18 at 17:27

Felipe Bormann

36117

answered Sep 11 '18 at 17:27

Felipe Bormann

36117

answered Sep 11 '18 at 17:27

Felipe Bormann

36117

$begingroup$
It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
$endgroup$
– The Asipiring one
Sep 12 '18 at 4:21

add a comment |

$begingroup$
It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
$endgroup$
– The Asipiring one
Sep 12 '18 at 4:21

It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.

– The Asipiring one
Sep 12 '18 at 4:21

add a comment |

edited Apr 10 at 8:30

answered Apr 10 at 8:24

Grzegorz Sionkowski

add a comment |

edited Apr 10 at 8:30

answered Apr 10 at 8:24

Grzegorz Sionkowski

add a comment |

edited Apr 10 at 8:30

answered Apr 10 at 8:24

Grzegorz Sionkowski

edited Apr 10 at 8:30

answered Apr 10 at 8:24

Grzegorz Sionkowski

edited Apr 10 at 8:30

answered Apr 10 at 8:24

Grzegorz Sionkowski

answered Apr 10 at 8:24

Grzegorz Sionkowski

answered Apr 10 at 8:24

Grzegorz Sionkowski

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

z0vZtcoMYh,Y,qUjI,AsVlm4 nYL4V4kZl5xjyqzUGUiC6ekSt5LUk8

搜尋此網誌

Trjtdtk

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers
2

2 Answers
2

2 Answers
2