How to handle missing data data in dependent variable?Solving a system of equations with sparse dataHow to start analysing and modelling data for an academic project, when not a statistician or data scientistHow to model this variable?Missing data imputation with KNNImputing for multiple missing variables using sklearnShould I Impute target values?Find the order of importance of random variables in their ability to explain a variance of YMissing value in continuous variable: Indicator variable vs. Indicator valueMissing Values In New DataDoubts on using linear regression for change attribution
How does this change to the opportunity attack rule impact combat?
Verb "geeitet" in an old scientific text
How I can I roll a number of non-digital dice to get a random number between 1 and 150?
How can I close a gap between my fence and my neighbor's that's on his side of the property line?
Building a list of products from the elements in another list
Why has the UK chosen to use Huawei infrastructure when Five Eyes allies haven't?
How long would it take for people to notice a mass disappearance?
Using column size much larger than necessary
As a Bard multi-classing into Warlock, what spells do I get?
I need a disease
Has the Hulk always been able to talk?
ZSPL language, anyone heard of it?
Are the Night's Watch still required?
Is the set of non invertible matrices simply connected? What are their homotopy and homology groups?
Out of scope work duties and resignation
Is there an official reason for not adding a post-credits scene?
Didn't attend field-specific conferences during my PhD; how much of a disadvantage is it?
What does this arrow symbol mean?
How to use dependency injection and avoid temporal coupling?
Controlled Hadamard gate in ZX-calculus
Frequency of specific viral sequence in .BAM or .fastq
Find the cheapest shipping option based on item weight
How can I get a job without pushing my family's income into a higher tax bracket?
Would glacier 'trees' be plausible?
How to handle missing data data in dependent variable?
Solving a system of equations with sparse dataHow to start analysing and modelling data for an academic project, when not a statistician or data scientistHow to model this variable?Missing data imputation with KNNImputing for multiple missing variables using sklearnShould I Impute target values?Find the order of importance of random variables in their ability to explain a variance of YMissing value in continuous variable: Indicator variable vs. Indicator valueMissing Values In New DataDoubts on using linear regression for change attribution
$begingroup$
I'm solving a ML problem statement where there are around 40k records in the dataset. A dependent variable is given in the question (There are many independent variables). But there are some 2k records in the dependent variable that have missing values.
- How do I solve this problem? Should I be excluding these records?
- Is imputing the missing values a good method of solving this problem? Wouldn't that give me inaccurate values as the dependent variable is dependent on a number of variables? I'm not sure.
Can anyone please help?
machine-learning regression data-cleaning linear-regression data-imputation
$endgroup$
add a comment |
$begingroup$
I'm solving a ML problem statement where there are around 40k records in the dataset. A dependent variable is given in the question (There are many independent variables). But there are some 2k records in the dependent variable that have missing values.
- How do I solve this problem? Should I be excluding these records?
- Is imputing the missing values a good method of solving this problem? Wouldn't that give me inaccurate values as the dependent variable is dependent on a number of variables? I'm not sure.
Can anyone please help?
machine-learning regression data-cleaning linear-regression data-imputation
$endgroup$
$begingroup$
Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
$endgroup$
– Harsh
Nov 11 '18 at 2:48
$begingroup$
I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
$endgroup$
– Harsh
Nov 11 '18 at 2:50
$begingroup$
The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:49
add a comment |
$begingroup$
I'm solving a ML problem statement where there are around 40k records in the dataset. A dependent variable is given in the question (There are many independent variables). But there are some 2k records in the dependent variable that have missing values.
- How do I solve this problem? Should I be excluding these records?
- Is imputing the missing values a good method of solving this problem? Wouldn't that give me inaccurate values as the dependent variable is dependent on a number of variables? I'm not sure.
Can anyone please help?
machine-learning regression data-cleaning linear-regression data-imputation
$endgroup$
I'm solving a ML problem statement where there are around 40k records in the dataset. A dependent variable is given in the question (There are many independent variables). But there are some 2k records in the dependent variable that have missing values.
- How do I solve this problem? Should I be excluding these records?
- Is imputing the missing values a good method of solving this problem? Wouldn't that give me inaccurate values as the dependent variable is dependent on a number of variables? I'm not sure.
Can anyone please help?
machine-learning regression data-cleaning linear-regression data-imputation
machine-learning regression data-cleaning linear-regression data-imputation
asked Nov 11 '18 at 1:59
Aditya KadrekarAditya Kadrekar
113
113
$begingroup$
Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
$endgroup$
– Harsh
Nov 11 '18 at 2:48
$begingroup$
I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
$endgroup$
– Harsh
Nov 11 '18 at 2:50
$begingroup$
The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:49
add a comment |
$begingroup$
Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
$endgroup$
– Harsh
Nov 11 '18 at 2:48
$begingroup$
I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
$endgroup$
– Harsh
Nov 11 '18 at 2:50
$begingroup$
The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:49
$begingroup$
Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
$endgroup$
– Harsh
Nov 11 '18 at 2:48
$begingroup$
Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
$endgroup$
– Harsh
Nov 11 '18 at 2:48
$begingroup$
I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
$endgroup$
– Harsh
Nov 11 '18 at 2:50
$begingroup$
I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
$endgroup$
– Harsh
Nov 11 '18 at 2:50
$begingroup$
The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:49
$begingroup$
The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:49
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.
You have three options:
Impute data
Throw away data
Use a classifier that can handle missing data, e.g.
xgboost
. See this answer.xgboost
is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.
Some other points:
The pattern of missing values is important, and can influence the choice of algorithm.
If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.
Regarding software, there are many options:
Scikit Learn has some imputation functions
MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.
You will find many resources if you search
$endgroup$
$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41027%2fhow-to-handle-missing-data-data-in-dependent-variable%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.
You have three options:
Impute data
Throw away data
Use a classifier that can handle missing data, e.g.
xgboost
. See this answer.xgboost
is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.
Some other points:
The pattern of missing values is important, and can influence the choice of algorithm.
If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.
Regarding software, there are many options:
Scikit Learn has some imputation functions
MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.
You will find many resources if you search
$endgroup$
$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39
add a comment |
$begingroup$
The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.
You have three options:
Impute data
Throw away data
Use a classifier that can handle missing data, e.g.
xgboost
. See this answer.xgboost
is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.
Some other points:
The pattern of missing values is important, and can influence the choice of algorithm.
If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.
Regarding software, there are many options:
Scikit Learn has some imputation functions
MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.
You will find many resources if you search
$endgroup$
$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39
add a comment |
$begingroup$
The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.
You have three options:
Impute data
Throw away data
Use a classifier that can handle missing data, e.g.
xgboost
. See this answer.xgboost
is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.
Some other points:
The pattern of missing values is important, and can influence the choice of algorithm.
If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.
Regarding software, there are many options:
Scikit Learn has some imputation functions
MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.
You will find many resources if you search
$endgroup$
The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.
You have three options:
Impute data
Throw away data
Use a classifier that can handle missing data, e.g.
xgboost
. See this answer.xgboost
is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.
Some other points:
The pattern of missing values is important, and can influence the choice of algorithm.
If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.
Regarding software, there are many options:
Scikit Learn has some imputation functions
MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.
You will find many resources if you search
answered Nov 11 '18 at 2:43
HarshHarsh
67148
67148
$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39
add a comment |
$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39
$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39
$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41027%2fhow-to-handle-missing-data-data-in-dependent-variable%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
$endgroup$
– Harsh
Nov 11 '18 at 2:48
$begingroup$
I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
$endgroup$
– Harsh
Nov 11 '18 at 2:50
$begingroup$
The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:49