How to handle missing data data in dependent variable?Solving a system of equations with sparse dataHow to start analysing and modelling data for an academic project, when not a statistician or data scientistHow to model this variable?Missing data imputation with KNNImputing for multiple missing variables using sklearnShould I Impute target values?Find the order of importance of random variables in their ability to explain a variance of YMissing value in continuous variable: Indicator variable vs. Indicator valueMissing Values In New DataDoubts on using linear regression for change attribution

How does this change to the opportunity attack rule impact combat?

Verb "geeitet" in an old scientific text

How I can I roll a number of non-digital dice to get a random number between 1 and 150?

How can I close a gap between my fence and my neighbor's that's on his side of the property line?

Building a list of products from the elements in another list

Why has the UK chosen to use Huawei infrastructure when Five Eyes allies haven't?

How long would it take for people to notice a mass disappearance?

Using column size much larger than necessary

As a Bard multi-classing into Warlock, what spells do I get?

I need a disease

Has the Hulk always been able to talk?

ZSPL language, anyone heard of it?

Are the Night's Watch still required?

Is the set of non invertible matrices simply connected? What are their homotopy and homology groups?

Out of scope work duties and resignation

Is there an official reason for not adding a post-credits scene?

Didn't attend field-specific conferences during my PhD; how much of a disadvantage is it?

What does this arrow symbol mean?

How to use dependency injection and avoid temporal coupling?

Controlled Hadamard gate in ZX-calculus

Frequency of specific viral sequence in .BAM or .fastq

Find the cheapest shipping option based on item weight

How can I get a job without pushing my family's income into a higher tax bracket?

Would glacier 'trees' be plausible?



How to handle missing data data in dependent variable?


Solving a system of equations with sparse dataHow to start analysing and modelling data for an academic project, when not a statistician or data scientistHow to model this variable?Missing data imputation with KNNImputing for multiple missing variables using sklearnShould I Impute target values?Find the order of importance of random variables in their ability to explain a variance of YMissing value in continuous variable: Indicator variable vs. Indicator valueMissing Values In New DataDoubts on using linear regression for change attribution













1












$begingroup$


I'm solving a ML problem statement where there are around 40k records in the dataset. A dependent variable is given in the question (There are many independent variables). But there are some 2k records in the dependent variable that have missing values.
- How do I solve this problem? Should I be excluding these records?
- Is imputing the missing values a good method of solving this problem? Wouldn't that give me inaccurate values as the dependent variable is dependent on a number of variables? I'm not sure.



Can anyone please help?










share|improve this question









$endgroup$











  • $begingroup$
    Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
    $endgroup$
    – Harsh
    Nov 11 '18 at 2:48










  • $begingroup$
    I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
    $endgroup$
    – Harsh
    Nov 11 '18 at 2:50










  • $begingroup$
    The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
    $endgroup$
    – Aditya Kadrekar
    Nov 11 '18 at 4:49















1












$begingroup$


I'm solving a ML problem statement where there are around 40k records in the dataset. A dependent variable is given in the question (There are many independent variables). But there are some 2k records in the dependent variable that have missing values.
- How do I solve this problem? Should I be excluding these records?
- Is imputing the missing values a good method of solving this problem? Wouldn't that give me inaccurate values as the dependent variable is dependent on a number of variables? I'm not sure.



Can anyone please help?










share|improve this question









$endgroup$











  • $begingroup$
    Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
    $endgroup$
    – Harsh
    Nov 11 '18 at 2:48










  • $begingroup$
    I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
    $endgroup$
    – Harsh
    Nov 11 '18 at 2:50










  • $begingroup$
    The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
    $endgroup$
    – Aditya Kadrekar
    Nov 11 '18 at 4:49













1












1








1





$begingroup$


I'm solving a ML problem statement where there are around 40k records in the dataset. A dependent variable is given in the question (There are many independent variables). But there are some 2k records in the dependent variable that have missing values.
- How do I solve this problem? Should I be excluding these records?
- Is imputing the missing values a good method of solving this problem? Wouldn't that give me inaccurate values as the dependent variable is dependent on a number of variables? I'm not sure.



Can anyone please help?










share|improve this question









$endgroup$




I'm solving a ML problem statement where there are around 40k records in the dataset. A dependent variable is given in the question (There are many independent variables). But there are some 2k records in the dependent variable that have missing values.
- How do I solve this problem? Should I be excluding these records?
- Is imputing the missing values a good method of solving this problem? Wouldn't that give me inaccurate values as the dependent variable is dependent on a number of variables? I'm not sure.



Can anyone please help?







machine-learning regression data-cleaning linear-regression data-imputation






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 11 '18 at 1:59









Aditya KadrekarAditya Kadrekar

113




113











  • $begingroup$
    Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
    $endgroup$
    – Harsh
    Nov 11 '18 at 2:48










  • $begingroup$
    I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
    $endgroup$
    – Harsh
    Nov 11 '18 at 2:50










  • $begingroup$
    The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
    $endgroup$
    – Aditya Kadrekar
    Nov 11 '18 at 4:49
















  • $begingroup$
    Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
    $endgroup$
    – Harsh
    Nov 11 '18 at 2:48










  • $begingroup$
    I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
    $endgroup$
    – Harsh
    Nov 11 '18 at 2:50










  • $begingroup$
    The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
    $endgroup$
    – Aditya Kadrekar
    Nov 11 '18 at 4:49















$begingroup$
Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
$endgroup$
– Harsh
Nov 11 '18 at 2:48




$begingroup$
Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
$endgroup$
– Harsh
Nov 11 '18 at 2:48












$begingroup$
I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
$endgroup$
– Harsh
Nov 11 '18 at 2:50




$begingroup$
I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
$endgroup$
– Harsh
Nov 11 '18 at 2:50












$begingroup$
The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:49




$begingroup$
The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:49










1 Answer
1






active

oldest

votes


















1












$begingroup$

The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.



You have three options:



  • Impute data


  • Throw away data


  • Use a classifier that can handle missing data, e.g. xgboost. See this answer. xgboost is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.


Some other points:



  • The pattern of missing values is important, and can influence the choice of algorithm.


  • If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.


Regarding software, there are many options:



  • Scikit Learn has some imputation functions


  • MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.


You will find many resources if you search






share|improve this answer









$endgroup$












  • $begingroup$
    Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
    $endgroup$
    – Aditya Kadrekar
    Nov 11 '18 at 4:39












Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41027%2fhow-to-handle-missing-data-data-in-dependent-variable%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1












$begingroup$

The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.



You have three options:



  • Impute data


  • Throw away data


  • Use a classifier that can handle missing data, e.g. xgboost. See this answer. xgboost is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.


Some other points:



  • The pattern of missing values is important, and can influence the choice of algorithm.


  • If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.


Regarding software, there are many options:



  • Scikit Learn has some imputation functions


  • MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.


You will find many resources if you search






share|improve this answer









$endgroup$












  • $begingroup$
    Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
    $endgroup$
    – Aditya Kadrekar
    Nov 11 '18 at 4:39
















1












$begingroup$

The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.



You have three options:



  • Impute data


  • Throw away data


  • Use a classifier that can handle missing data, e.g. xgboost. See this answer. xgboost is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.


Some other points:



  • The pattern of missing values is important, and can influence the choice of algorithm.


  • If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.


Regarding software, there are many options:



  • Scikit Learn has some imputation functions


  • MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.


You will find many resources if you search






share|improve this answer









$endgroup$












  • $begingroup$
    Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
    $endgroup$
    – Aditya Kadrekar
    Nov 11 '18 at 4:39














1












1








1





$begingroup$

The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.



You have three options:



  • Impute data


  • Throw away data


  • Use a classifier that can handle missing data, e.g. xgboost. See this answer. xgboost is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.


Some other points:



  • The pattern of missing values is important, and can influence the choice of algorithm.


  • If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.


Regarding software, there are many options:



  • Scikit Learn has some imputation functions


  • MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.


You will find many resources if you search






share|improve this answer









$endgroup$



The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.



You have three options:



  • Impute data


  • Throw away data


  • Use a classifier that can handle missing data, e.g. xgboost. See this answer. xgboost is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.


Some other points:



  • The pattern of missing values is important, and can influence the choice of algorithm.


  • If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.


Regarding software, there are many options:



  • Scikit Learn has some imputation functions


  • MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.


You will find many resources if you search







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 11 '18 at 2:43









HarshHarsh

67148




67148











  • $begingroup$
    Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
    $endgroup$
    – Aditya Kadrekar
    Nov 11 '18 at 4:39

















  • $begingroup$
    Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
    $endgroup$
    – Aditya Kadrekar
    Nov 11 '18 at 4:39
















$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39





$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39


















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41027%2fhow-to-handle-missing-data-data-in-dependent-variable%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High