Column With Many Missing Values (36%) Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsScikit Learn Missing Data - Categorical valuesHow to replace NA values with another value in factors in R?Fill missing values AND normaliseImputation missing values other than using Mean, Median in pythonwhat to do if the missing data in one column is based on some value/condition in another column in r?What is the difference between Missing at Random and Missing not at Random data?Investigate why data is missing? After finding out reasons, what should I do next?Missing Values In New DataHow to fill missing numeric if any value in a subset is missing, all other columns with the same subset are missingHandling NA Values in the Chicago Crime Rate data set
A term for a woman complaining about things/begging in a cute/childish way
How does Belgium enforce obligatory attendance in elections?
Putting class ranking in CV, but against dept guidelines
Where is the Data Import Wizard Error Log
How to pronounce 伝統色
Do wooden building fires get hotter than 600°C?
Converted a Scalar function to a TVF function for parallel execution-Still running in Serial mode
Google .dev domain strangely redirects to https
How to unroll a parameter pack from right to left
How did Fremen produce and carry enough thumpers to use Sandworms as de facto Ubers?
Project Euler #1 in C++
How to align multiple equations
Electrolysis of water: Which equations to use? (IB Chem)
How to dry out epoxy resin faster than usual?
How does a spellshard spellbook work?
A letter with no particular backstory
Is it fair for a professor to grade us on the possession of past papers?
If Windows 7 doesn't support WSL, then what is "Subsystem for UNIX-based Applications"?
Lagrange four-squares theorem --- deterministic complexity
What happened to Thoros of Myr's flaming sword?
How would a mousetrap for use in space work?
AppleTVs create a chatty alternate WiFi network
How does the math work when buying airline miles?
Significance of Cersei's obsession with elephants?
Column With Many Missing Values (36%)
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsScikit Learn Missing Data - Categorical valuesHow to replace NA values with another value in factors in R?Fill missing values AND normaliseImputation missing values other than using Mean, Median in pythonwhat to do if the missing data in one column is based on some value/condition in another column in r?What is the difference between Missing at Random and Missing not at Random data?Investigate why data is missing? After finding out reasons, what should I do next?Missing Values In New DataHow to fill missing numeric if any value in a subset is missing, all other columns with the same subset are missingHandling NA Values in the Chicago Crime Rate data set
$begingroup$
Hello this is my first machine learning project, I got a dataset with 18.000 rows and I have a column with 4244 values missing.
I don't know why the values are missing since when it's appropriate there's a 0 value in it.
The dtype of the column is int64 I consider this column usable and would like to implement it to the model.
Could you please help me with how to deal with this problem, or lead my to a resource to teach me how to deal with this ?
machine-learning dataset missing-data
$endgroup$
add a comment |
$begingroup$
Hello this is my first machine learning project, I got a dataset with 18.000 rows and I have a column with 4244 values missing.
I don't know why the values are missing since when it's appropriate there's a 0 value in it.
The dtype of the column is int64 I consider this column usable and would like to implement it to the model.
Could you please help me with how to deal with this problem, or lead my to a resource to teach me how to deal with this ?
machine-learning dataset missing-data
$endgroup$
add a comment |
$begingroup$
Hello this is my first machine learning project, I got a dataset with 18.000 rows and I have a column with 4244 values missing.
I don't know why the values are missing since when it's appropriate there's a 0 value in it.
The dtype of the column is int64 I consider this column usable and would like to implement it to the model.
Could you please help me with how to deal with this problem, or lead my to a resource to teach me how to deal with this ?
machine-learning dataset missing-data
$endgroup$
Hello this is my first machine learning project, I got a dataset with 18.000 rows and I have a column with 4244 values missing.
I don't know why the values are missing since when it's appropriate there's a 0 value in it.
The dtype of the column is int64 I consider this column usable and would like to implement it to the model.
Could you please help me with how to deal with this problem, or lead my to a resource to teach me how to deal with this ?
machine-learning dataset missing-data
machine-learning dataset missing-data
asked Apr 4 at 15:47
dungeondungeon
394
394
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
I don't know why the values are missing since when it's appropriate there's a 0 value in it.
The first step is to check with some SME (Subject Matter Expert) or the Data Custodian. I can't tell you how many times I've built a model/started analysis just to figure out that the data was wrong. Try to figure out the reason behind the Nulls/0.
Besides that there are many ways to handle missing data a few are below:
Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)
Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).
Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial
Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!
$endgroup$
add a comment |
$begingroup$
What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.
If you are using a Pandas DataFrame then you can do:
Replace with 0
df = df.fillna(0)
Replace with column mean
df = df.fillna(np.mean())
Replace with column median
df = df.fillna(np.median())
If you are using numpy you could do:
Replace with 0
X = np.nan_to_num(X)
Replace with mean
col_mean = np.nanmean(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_mean, inds[1])
Replace with median
col_median = np.nanmedian(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_median, inds[1])
If you want some reading: imputation strategies
$endgroup$
$begingroup$
So it's ok to just fill the column with nan values even if theres gonna be so many of them?
$endgroup$
– dungeon
Apr 4 at 17:21
$begingroup$
Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
$endgroup$
– Michael M
Apr 4 at 17:54
$begingroup$
I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
$endgroup$
– Simon Larsson
Apr 4 at 17:57
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48621%2fcolumn-with-many-missing-values-36%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I don't know why the values are missing since when it's appropriate there's a 0 value in it.
The first step is to check with some SME (Subject Matter Expert) or the Data Custodian. I can't tell you how many times I've built a model/started analysis just to figure out that the data was wrong. Try to figure out the reason behind the Nulls/0.
Besides that there are many ways to handle missing data a few are below:
Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)
Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).
Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial
Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!
$endgroup$
add a comment |
$begingroup$
I don't know why the values are missing since when it's appropriate there's a 0 value in it.
The first step is to check with some SME (Subject Matter Expert) or the Data Custodian. I can't tell you how many times I've built a model/started analysis just to figure out that the data was wrong. Try to figure out the reason behind the Nulls/0.
Besides that there are many ways to handle missing data a few are below:
Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)
Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).
Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial
Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!
$endgroup$
add a comment |
$begingroup$
I don't know why the values are missing since when it's appropriate there's a 0 value in it.
The first step is to check with some SME (Subject Matter Expert) or the Data Custodian. I can't tell you how many times I've built a model/started analysis just to figure out that the data was wrong. Try to figure out the reason behind the Nulls/0.
Besides that there are many ways to handle missing data a few are below:
Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)
Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).
Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial
Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!
$endgroup$
I don't know why the values are missing since when it's appropriate there's a 0 value in it.
The first step is to check with some SME (Subject Matter Expert) or the Data Custodian. I can't tell you how many times I've built a model/started analysis just to figure out that the data was wrong. Try to figure out the reason behind the Nulls/0.
Besides that there are many ways to handle missing data a few are below:
Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)
Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).
Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial
Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!
answered Apr 4 at 20:03
MattRMattR
1362
1362
add a comment |
add a comment |
$begingroup$
What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.
If you are using a Pandas DataFrame then you can do:
Replace with 0
df = df.fillna(0)
Replace with column mean
df = df.fillna(np.mean())
Replace with column median
df = df.fillna(np.median())
If you are using numpy you could do:
Replace with 0
X = np.nan_to_num(X)
Replace with mean
col_mean = np.nanmean(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_mean, inds[1])
Replace with median
col_median = np.nanmedian(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_median, inds[1])
If you want some reading: imputation strategies
$endgroup$
$begingroup$
So it's ok to just fill the column with nan values even if theres gonna be so many of them?
$endgroup$
– dungeon
Apr 4 at 17:21
$begingroup$
Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
$endgroup$
– Michael M
Apr 4 at 17:54
$begingroup$
I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
$endgroup$
– Simon Larsson
Apr 4 at 17:57
add a comment |
$begingroup$
What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.
If you are using a Pandas DataFrame then you can do:
Replace with 0
df = df.fillna(0)
Replace with column mean
df = df.fillna(np.mean())
Replace with column median
df = df.fillna(np.median())
If you are using numpy you could do:
Replace with 0
X = np.nan_to_num(X)
Replace with mean
col_mean = np.nanmean(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_mean, inds[1])
Replace with median
col_median = np.nanmedian(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_median, inds[1])
If you want some reading: imputation strategies
$endgroup$
$begingroup$
So it's ok to just fill the column with nan values even if theres gonna be so many of them?
$endgroup$
– dungeon
Apr 4 at 17:21
$begingroup$
Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
$endgroup$
– Michael M
Apr 4 at 17:54
$begingroup$
I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
$endgroup$
– Simon Larsson
Apr 4 at 17:57
add a comment |
$begingroup$
What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.
If you are using a Pandas DataFrame then you can do:
Replace with 0
df = df.fillna(0)
Replace with column mean
df = df.fillna(np.mean())
Replace with column median
df = df.fillna(np.median())
If you are using numpy you could do:
Replace with 0
X = np.nan_to_num(X)
Replace with mean
col_mean = np.nanmean(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_mean, inds[1])
Replace with median
col_median = np.nanmedian(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_median, inds[1])
If you want some reading: imputation strategies
$endgroup$
What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.
If you are using a Pandas DataFrame then you can do:
Replace with 0
df = df.fillna(0)
Replace with column mean
df = df.fillna(np.mean())
Replace with column median
df = df.fillna(np.median())
If you are using numpy you could do:
Replace with 0
X = np.nan_to_num(X)
Replace with mean
col_mean = np.nanmean(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_mean, inds[1])
Replace with median
col_median = np.nanmedian(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_median, inds[1])
If you want some reading: imputation strategies
edited Apr 4 at 18:13
answered Apr 4 at 16:04
Simon LarssonSimon Larsson
935214
935214
$begingroup$
So it's ok to just fill the column with nan values even if theres gonna be so many of them?
$endgroup$
– dungeon
Apr 4 at 17:21
$begingroup$
Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
$endgroup$
– Michael M
Apr 4 at 17:54
$begingroup$
I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
$endgroup$
– Simon Larsson
Apr 4 at 17:57
add a comment |
$begingroup$
So it's ok to just fill the column with nan values even if theres gonna be so many of them?
$endgroup$
– dungeon
Apr 4 at 17:21
$begingroup$
Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
$endgroup$
– Michael M
Apr 4 at 17:54
$begingroup$
I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
$endgroup$
– Simon Larsson
Apr 4 at 17:57
$begingroup$
So it's ok to just fill the column with nan values even if theres gonna be so many of them?
$endgroup$
– dungeon
Apr 4 at 17:21
$begingroup$
So it's ok to just fill the column with nan values even if theres gonna be so many of them?
$endgroup$
– dungeon
Apr 4 at 17:21
$begingroup$
Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
$endgroup$
– Michael M
Apr 4 at 17:54
$begingroup$
Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
$endgroup$
– Michael M
Apr 4 at 17:54
$begingroup$
I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
$endgroup$
– Simon Larsson
Apr 4 at 17:57
$begingroup$
I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
$endgroup$
– Simon Larsson
Apr 4 at 17:57
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48621%2fcolumn-with-many-missing-values-36%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown