what to do if the missing data in one column is based on some value/condition in another column in r?Supervised Learning with Necessarily Missing DataHow does SQL Server Analysis Services compare to R?Method for predicting price based on Geographical market, Product, and CompanyChoice of replacing missing values based on the data distributionFix missing data by adding another feature instead of using the mean?Prediction questions related to the datasetWhat is the difference between Missing at Random and Missing not at Random data?How to deal with missing data for only some categoriesInvestigate why data is missing? After finding out reasons, what should I do next?Training on data with inherently non-applicable data cells
Why is "breaking the mould" positively connoted?
Why isn't nylon as strong as kevlar?
Is there an idiom that support the idea that "inflation is bad"?
Copy previous line to current line from text file
Is the set of non invertible matrices simply connected? What are their homotopy and homology groups?
Validation rule Scheduled Apex
How do LIGO and VIRGO know that a gravitational wave has its origin in a neutron star or a black hole?
Why has the UK chosen to use Huawei infrastructure when Five Eyes allies haven't?
What is the closest airport to the center of the city it serves?
Where can I go to avoid planes overhead?
Introducing Gladys, an intrepid globetrotter
Does a card have a keyword if it has the same effect as said keyword?
Will 700 more planes a day fly because of the Heathrow expansion?
What does this colon mean? It is not labeling, it is not ternary operator
Do you know any research on finding closed forms of recursively-defined sequences?
What are the differences between credential stuffing and password spraying?
How long would it take for people to notice a mass disappearance?
Formating an equation
How can I get people to remember my character's gender?
Why do people keep telling me that I am a bad photographer?
How I can I roll a number of non-digital dice to get a random number between 1 and 150?
My rubber roof has developed a leak. Can it be fixed?
I have a unique character that I'm having a problem writing. He's a virus!
How can I close a gap between my fence and my neighbor's that's on his side of the property line?
what to do if the missing data in one column is based on some value/condition in another column in r?
Supervised Learning with Necessarily Missing DataHow does SQL Server Analysis Services compare to R?Method for predicting price based on Geographical market, Product, and CompanyChoice of replacing missing values based on the data distributionFix missing data by adding another feature instead of using the mean?Prediction questions related to the datasetWhat is the difference between Missing at Random and Missing not at Random data?How to deal with missing data for only some categoriesInvestigate why data is missing? After finding out reasons, what should I do next?Training on data with inherently non-applicable data cells
$begingroup$
I have a dataset with 20,000 observations and 19 variables. To start off with I have a gender column which has three levels namely 'M', 'F' and 'U' where U can be taken as not disclosed. Whenever there is a 'U' in the gender column, there is an NA in two of the other columns namely Age and Tenure. This could basically be interpreted as a person who is not ready to disclose their Gender is not ready to disclose their age and tenure. How do I deal with such a situation? Apart from these three columns there are other 16 columns in the dataset that have got meaningful data in them. Would the normal imputation techniques out there like a KNN Imputation help me out in such a case?
Here is my reproducible example that I have tried my best with:
x<-data.frame(gender=c('M','M','F','F','U','F','M','U'),age=c(21,24,20,34,NA,40,56,NA),tenure=c(7,4,5,3,NA,2,4,NA),job=c('Doctor','IT','Banking','Truck Driver','Finance','Agriculture','Electrician','Teacher'),country=c('Australia','America','New Zealand','Sweden','England','France','Denmark','Norway'))
The Dataframe:
gender age tenure job country
1 M 21 7 Doctor Australia
2 M 24 4 IT America
3 F 20 5 Banking New Zealand
4 F 34 3 Truck Driver Sweden
5 U NA NA Finance England
6 F 40 2 Agriculture France
7 M 56 4 Electrician Denmark
8 U NA NA Teacher Norway
As you can see from the example above whenever the gender is undefined, there are missing values in both age and tenure and this is the case overall in the entire dataset. What would be the best way to deal with such a situation? And this what is called a Missing at Random data, is that right? Any suggestions would be extremely helpful. Thanks a lot.
machine-learning r data-mining missing-data data-imputation
$endgroup$
add a comment |
$begingroup$
I have a dataset with 20,000 observations and 19 variables. To start off with I have a gender column which has three levels namely 'M', 'F' and 'U' where U can be taken as not disclosed. Whenever there is a 'U' in the gender column, there is an NA in two of the other columns namely Age and Tenure. This could basically be interpreted as a person who is not ready to disclose their Gender is not ready to disclose their age and tenure. How do I deal with such a situation? Apart from these three columns there are other 16 columns in the dataset that have got meaningful data in them. Would the normal imputation techniques out there like a KNN Imputation help me out in such a case?
Here is my reproducible example that I have tried my best with:
x<-data.frame(gender=c('M','M','F','F','U','F','M','U'),age=c(21,24,20,34,NA,40,56,NA),tenure=c(7,4,5,3,NA,2,4,NA),job=c('Doctor','IT','Banking','Truck Driver','Finance','Agriculture','Electrician','Teacher'),country=c('Australia','America','New Zealand','Sweden','England','France','Denmark','Norway'))
The Dataframe:
gender age tenure job country
1 M 21 7 Doctor Australia
2 M 24 4 IT America
3 F 20 5 Banking New Zealand
4 F 34 3 Truck Driver Sweden
5 U NA NA Finance England
6 F 40 2 Agriculture France
7 M 56 4 Electrician Denmark
8 U NA NA Teacher Norway
As you can see from the example above whenever the gender is undefined, there are missing values in both age and tenure and this is the case overall in the entire dataset. What would be the best way to deal with such a situation? And this what is called a Missing at Random data, is that right? Any suggestions would be extremely helpful. Thanks a lot.
machine-learning r data-mining missing-data data-imputation
$endgroup$
$begingroup$
This is Missing Not At Random.
$endgroup$
– user2974951
Dec 11 '18 at 8:31
add a comment |
$begingroup$
I have a dataset with 20,000 observations and 19 variables. To start off with I have a gender column which has three levels namely 'M', 'F' and 'U' where U can be taken as not disclosed. Whenever there is a 'U' in the gender column, there is an NA in two of the other columns namely Age and Tenure. This could basically be interpreted as a person who is not ready to disclose their Gender is not ready to disclose their age and tenure. How do I deal with such a situation? Apart from these three columns there are other 16 columns in the dataset that have got meaningful data in them. Would the normal imputation techniques out there like a KNN Imputation help me out in such a case?
Here is my reproducible example that I have tried my best with:
x<-data.frame(gender=c('M','M','F','F','U','F','M','U'),age=c(21,24,20,34,NA,40,56,NA),tenure=c(7,4,5,3,NA,2,4,NA),job=c('Doctor','IT','Banking','Truck Driver','Finance','Agriculture','Electrician','Teacher'),country=c('Australia','America','New Zealand','Sweden','England','France','Denmark','Norway'))
The Dataframe:
gender age tenure job country
1 M 21 7 Doctor Australia
2 M 24 4 IT America
3 F 20 5 Banking New Zealand
4 F 34 3 Truck Driver Sweden
5 U NA NA Finance England
6 F 40 2 Agriculture France
7 M 56 4 Electrician Denmark
8 U NA NA Teacher Norway
As you can see from the example above whenever the gender is undefined, there are missing values in both age and tenure and this is the case overall in the entire dataset. What would be the best way to deal with such a situation? And this what is called a Missing at Random data, is that right? Any suggestions would be extremely helpful. Thanks a lot.
machine-learning r data-mining missing-data data-imputation
$endgroup$
I have a dataset with 20,000 observations and 19 variables. To start off with I have a gender column which has three levels namely 'M', 'F' and 'U' where U can be taken as not disclosed. Whenever there is a 'U' in the gender column, there is an NA in two of the other columns namely Age and Tenure. This could basically be interpreted as a person who is not ready to disclose their Gender is not ready to disclose their age and tenure. How do I deal with such a situation? Apart from these three columns there are other 16 columns in the dataset that have got meaningful data in them. Would the normal imputation techniques out there like a KNN Imputation help me out in such a case?
Here is my reproducible example that I have tried my best with:
x<-data.frame(gender=c('M','M','F','F','U','F','M','U'),age=c(21,24,20,34,NA,40,56,NA),tenure=c(7,4,5,3,NA,2,4,NA),job=c('Doctor','IT','Banking','Truck Driver','Finance','Agriculture','Electrician','Teacher'),country=c('Australia','America','New Zealand','Sweden','England','France','Denmark','Norway'))
The Dataframe:
gender age tenure job country
1 M 21 7 Doctor Australia
2 M 24 4 IT America
3 F 20 5 Banking New Zealand
4 F 34 3 Truck Driver Sweden
5 U NA NA Finance England
6 F 40 2 Agriculture France
7 M 56 4 Electrician Denmark
8 U NA NA Teacher Norway
As you can see from the example above whenever the gender is undefined, there are missing values in both age and tenure and this is the case overall in the entire dataset. What would be the best way to deal with such a situation? And this what is called a Missing at Random data, is that right? Any suggestions would be extremely helpful. Thanks a lot.
machine-learning r data-mining missing-data data-imputation
machine-learning r data-mining missing-data data-imputation
asked Sep 11 '18 at 16:18
The Asipiring oneThe Asipiring one
211
211
$begingroup$
This is Missing Not At Random.
$endgroup$
– user2974951
Dec 11 '18 at 8:31
add a comment |
$begingroup$
This is Missing Not At Random.
$endgroup$
– user2974951
Dec 11 '18 at 8:31
$begingroup$
This is Missing Not At Random.
$endgroup$
– user2974951
Dec 11 '18 at 8:31
$begingroup$
This is Missing Not At Random.
$endgroup$
– user2974951
Dec 11 '18 at 8:31
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
You kinda answered your question, if you pay close attention you already now the value the gender U has in your dataset. Why don't you train a different classifier (thus, building 2) for the gender U and then choose the classifier based if the user meets the condition of it's gender being "U" or not?.
Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.
$endgroup$
$begingroup$
It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
$endgroup$
– The Asipiring one
Sep 12 '18 at 4:21
add a comment |
$begingroup$
If you plan to use LightGBM or XGBoost, the advise is "Do not do anything". These methods treat NA in a specific way, different in each decision tree and the results obtained are usually much better than using imputing.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38108%2fwhat-to-do-if-the-missing-data-in-one-column-is-based-on-some-value-condition-in%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You kinda answered your question, if you pay close attention you already now the value the gender U has in your dataset. Why don't you train a different classifier (thus, building 2) for the gender U and then choose the classifier based if the user meets the condition of it's gender being "U" or not?.
Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.
$endgroup$
$begingroup$
It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
$endgroup$
– The Asipiring one
Sep 12 '18 at 4:21
add a comment |
$begingroup$
You kinda answered your question, if you pay close attention you already now the value the gender U has in your dataset. Why don't you train a different classifier (thus, building 2) for the gender U and then choose the classifier based if the user meets the condition of it's gender being "U" or not?.
Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.
$endgroup$
$begingroup$
It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
$endgroup$
– The Asipiring one
Sep 12 '18 at 4:21
add a comment |
$begingroup$
You kinda answered your question, if you pay close attention you already now the value the gender U has in your dataset. Why don't you train a different classifier (thus, building 2) for the gender U and then choose the classifier based if the user meets the condition of it's gender being "U" or not?.
Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.
$endgroup$
You kinda answered your question, if you pay close attention you already now the value the gender U has in your dataset. Why don't you train a different classifier (thus, building 2) for the gender U and then choose the classifier based if the user meets the condition of it's gender being "U" or not?.
Also, can you elaborate more on the goal of the model you're building? That might change the approach to the problem.
edited Sep 11 '18 at 17:35
answered Sep 11 '18 at 17:27
Felipe BormannFelipe Bormann
36117
36117
$begingroup$
It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
$endgroup$
– The Asipiring one
Sep 12 '18 at 4:21
add a comment |
$begingroup$
It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
$endgroup$
– The Asipiring one
Sep 12 '18 at 4:21
$begingroup$
It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
$endgroup$
– The Asipiring one
Sep 12 '18 at 4:21
$begingroup$
It is a dataset that contains details about a list of transactions and the associated user profile and I am trying to build a model to predict whether the transaction was successful or was it cancelled. Different classifier as in? Would you please be able to elobarate a bit on that please? I would like to explore your idea. Thanks.
$endgroup$
– The Asipiring one
Sep 12 '18 at 4:21
add a comment |
$begingroup$
If you plan to use LightGBM or XGBoost, the advise is "Do not do anything". These methods treat NA in a specific way, different in each decision tree and the results obtained are usually much better than using imputing.
$endgroup$
add a comment |
$begingroup$
If you plan to use LightGBM or XGBoost, the advise is "Do not do anything". These methods treat NA in a specific way, different in each decision tree and the results obtained are usually much better than using imputing.
$endgroup$
add a comment |
$begingroup$
If you plan to use LightGBM or XGBoost, the advise is "Do not do anything". These methods treat NA in a specific way, different in each decision tree and the results obtained are usually much better than using imputing.
$endgroup$
If you plan to use LightGBM or XGBoost, the advise is "Do not do anything". These methods treat NA in a specific way, different in each decision tree and the results obtained are usually much better than using imputing.
edited Apr 10 at 8:30
answered Apr 10 at 8:24
Grzegorz SionkowskiGrzegorz Sionkowski
11
11
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38108%2fwhat-to-do-if-the-missing-data-in-one-column-is-based-on-some-value-condition-in%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
This is Missing Not At Random.
$endgroup$
– user2974951
Dec 11 '18 at 8:31