How to handle missing data data in dependent variable?Solving a system of equations with sparse dataHow to start analysing and modelling data for an academic project, when not a statistician or data scientistHow to model this variable?Missing data imputation with KNNImputing for multiple missing variables using sklearnShould I Impute target values?Find the order of importance of random variables in their ability to explain a variance of YMissing value in continuous variable: Indicator variable vs. Indicator valueMissing Values In New DataDoubts on using linear regression for change attribution

How does this change to the opportunity attack rule impact combat?

Verb "geeitet" in an old scientific text

How I can I roll a number of non-digital dice to get a random number between 1 and 150?

How can I close a gap between my fence and my neighbor's that's on his side of the property line?

Building a list of products from the elements in another list

Why has the UK chosen to use Huawei infrastructure when Five Eyes allies haven't?

How long would it take for people to notice a mass disappearance?

Using column size much larger than necessary

As a Bard multi-classing into Warlock, what spells do I get?

I need a disease

Has the Hulk always been able to talk?

ZSPL language, anyone heard of it?

Are the Night's Watch still required?

Is the set of non invertible matrices simply connected? What are their homotopy and homology groups?

Out of scope work duties and resignation

Is there an official reason for not adding a post-credits scene?

Didn't attend field-specific conferences during my PhD; how much of a disadvantage is it?

What does this arrow symbol mean?

How to use dependency injection and avoid temporal coupling?

Controlled Hadamard gate in ZX-calculus

Frequency of specific viral sequence in .BAM or .fastq

Find the cheapest shipping option based on item weight

How can I get a job without pushing my family's income into a higher tax bracket?

Would glacier 'trees' be plausible?

How to handle missing data data in dependent variable?

Solving a system of equations with sparse dataHow to start analysing and modelling data for an academic project, when not a statistician or data scientistHow to model this variable?Missing data imputation with KNNImputing for multiple missing variables using sklearnShould I Impute target values?Find the order of importance of random variables in their ability to explain a variance of YMissing value in continuous variable: Indicator variable vs. Indicator valueMissing Values In New DataDoubts on using linear regression for change attribution

I'm solving a ML problem statement where there are around 40k records in the dataset. A dependent variable is given in the question (There are many independent variables). But there are some 2k records in the dependent variable that have missing values.
- How do I solve this problem? Should I be excluding these records?
- Is imputing the missing values a good method of solving this problem? Wouldn't that give me inaccurate values as the dependent variable is dependent on a number of variables? I'm not sure.

Can anyone please help?

asked Nov 11 '18 at 1:59

Aditya Kadrekar

113

$begingroup$
Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
$endgroup$
– Harsh
Nov 11 '18 at 2:48

$begingroup$
I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
$endgroup$
– Harsh
Nov 11 '18 at 2:50

$begingroup$
The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:49

add a comment |

Can anyone please help?

asked Nov 11 '18 at 1:59

Aditya Kadrekar

113

$begingroup$
Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
$endgroup$
– Harsh
Nov 11 '18 at 2:48

$begingroup$
I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
$endgroup$
– Harsh
Nov 11 '18 at 2:50

$begingroup$
The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:49

add a comment |

Can anyone please help?

asked Nov 11 '18 at 1:59

Aditya Kadrekar

113

Can anyone please help?

machine-learning regression data-cleaning linear-regression data-imputation

asked Nov 11 '18 at 1:59

Aditya Kadrekar

113

asked Nov 11 '18 at 1:59

Aditya Kadrekar

113

asked Nov 11 '18 at 1:59

Aditya Kadrekar

113

asked Nov 11 '18 at 1:59

Aditya Kadrekar

113

asked Nov 11 '18 at 1:59

Aditya Kadrekar

113

$begingroup$
Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
$endgroup$
– Harsh
Nov 11 '18 at 2:48

$begingroup$
I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
$endgroup$
– Harsh
Nov 11 '18 at 2:50

$begingroup$
The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:49

add a comment |

$begingroup$
Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.
$endgroup$
– Harsh
Nov 11 '18 at 2:48

$begingroup$
I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it
$endgroup$
– Harsh
Nov 11 '18 at 2:50

$begingroup$
The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:49

Isn't the task to predict the dependent variable? If the dependent variable is missing, you simply can't use that data.

– Harsh
Nov 11 '18 at 2:48

I'm not sure if this is what you're asking for, but semi-supervised methods can make use of a large amount of unlabeled data and a small amount of supervised data to make a classifier. You seem to have very few such data points, so you might not find this worth it

– Harsh
Nov 11 '18 at 2:50

The task is to apply any ML model which I find it apt for this problem statement. It could be literally anything - regression, classification or clustering.

– Aditya Kadrekar
Nov 11 '18 at 4:49

add a comment |

1 Answer
1

active

oldest

votes

The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.

You have three options:

Impute data

Throw away data

Use a classifier that can handle missing data, e.g. xgboost. See this answer. xgboost is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.

Some other points:

The pattern of missing values is important, and can influence the choice of algorithm.

If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.

Regarding software, there are many options:

Scikit Learn has some imputation functions

MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.

You will find many resources if you search

answered Nov 11 '18 at 2:43

Harsh

67148

$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f41027%2fhow-to-handle-missing-data-data-in-dependent-variable%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.

You have three options:

Impute data

Throw away data

Use a classifier that can handle missing data, e.g. xgboost. See this answer. xgboost is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.

Some other points:

The pattern of missing values is important, and can influence the choice of algorithm.

If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.

Regarding software, there are many options:

Scikit Learn has some imputation functions

MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.

You will find many resources if you search

answered Nov 11 '18 at 2:43

Harsh

67148

$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39

add a comment |

The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.

You have three options:

Impute data

Throw away data

Use a classifier that can handle missing data, e.g. xgboost. See this answer. xgboost is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.

Some other points:

The pattern of missing values is important, and can influence the choice of algorithm.

If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.

Regarding software, there are many options:

Scikit Learn has some imputation functions

MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.

You will find many resources if you search

answered Nov 11 '18 at 2:43

Harsh

67148

$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39

add a comment |

The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.

You have three options:

Impute data

Throw away data

Use a classifier that can handle missing data, e.g. xgboost. See this answer. xgboost is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.

Some other points:

The pattern of missing values is important, and can influence the choice of algorithm.

If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.

Regarding software, there are many options:

Scikit Learn has some imputation functions

MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.

You will find many resources if you search

answered Nov 11 '18 at 2:43

Harsh

67148

The questions you're asking are empirical questions. The only answer anyone can give is to try all of them and see which works better.

You have three options:

Impute data

Throw away data

Use a classifier that can handle missing data, e.g. xgboost. See this answer. xgboost is a powerful classifier. So, if you're not tuning very hard for performance, xgboost is a great way to get a good v0.

Some other points:

The pattern of missing values is important, and can influence the choice of algorithm.

If your dataset is noisy, imputing may only amplify the noise. If you can afford to drop those 2k rows, try that, or train both with and without that data and see if the combination performs better.

Regarding software, there are many options:

Scikit Learn has some imputation functions

MICE, Multiple Imputation through Chained Equations works well for random data. Available in fancyimpute, and also in statsmodels.

You will find many resources if you search

answered Nov 11 '18 at 2:43

Harsh

67148

answered Nov 11 '18 at 2:43

Harsh

67148

answered Nov 11 '18 at 2:43

Harsh

67148

answered Nov 11 '18 at 2:43

Harsh

67148

$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39

add a comment |

$begingroup$
Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.
$endgroup$
– Aditya Kadrekar
Nov 11 '18 at 4:39

Thanks for your inputs! I'm going to try two methods - 1. Exclude the missing data and then just split the dataset into train and test set 2. Keep all the missing data in test set. Train the model and predict the values. And then compare with the imputed mean. Basically, I have been given a dataset with a target variable. The concern here is there is no metadata given. So it's hard to understand which independent variables might influence the target variable. Year, month, day and hour are also given as columns.

– Aditya Kadrekar
Nov 11 '18 at 4:39

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

z1zZU 2 5r PzTOwkwC9KyoYoaQ

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1