Column With Many Missing Values (36%) Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsScikit Learn Missing Data - Categorical valuesHow to replace NA values with another value in factors in R?Fill missing values AND normaliseImputation missing values other than using Mean, Median in pythonwhat to do if the missing data in one column is based on some value/condition in another column in r?What is the difference between Missing at Random and Missing not at Random data?Investigate why data is missing? After finding out reasons, what should I do next?Missing Values In New DataHow to fill missing numeric if any value in a subset is missing, all other columns with the same subset are missingHandling NA Values in the Chicago Crime Rate data set

A term for a woman complaining about things/begging in a cute/childish way

How does Belgium enforce obligatory attendance in elections?

Putting class ranking in CV, but against dept guidelines

Where is the Data Import Wizard Error Log

How to pronounce 伝統色

Do wooden building fires get hotter than 600°C?

Converted a Scalar function to a TVF function for parallel execution-Still running in Serial mode

Google .dev domain strangely redirects to https

How to unroll a parameter pack from right to left

How did Fremen produce and carry enough thumpers to use Sandworms as de facto Ubers?

Project Euler #1 in C++

How to align multiple equations

Electrolysis of water: Which equations to use? (IB Chem)

How to dry out epoxy resin faster than usual?

How does a spellshard spellbook work?

A letter with no particular backstory

Is it fair for a professor to grade us on the possession of past papers?

If Windows 7 doesn't support WSL, then what is "Subsystem for UNIX-based Applications"?

Lagrange four-squares theorem --- deterministic complexity

What happened to Thoros of Myr's flaming sword?

How would a mousetrap for use in space work?

AppleTVs create a chatty alternate WiFi network

How does the math work when buying airline miles?

Significance of Cersei's obsession with elephants?

Column With Many Missing Values (36%)

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsScikit Learn Missing Data - Categorical valuesHow to replace NA values with another value in factors in R?Fill missing values AND normaliseImputation missing values other than using Mean, Median in pythonwhat to do if the missing data in one column is based on some value/condition in another column in r?What is the difference between Missing at Random and Missing not at Random data?Investigate why data is missing? After finding out reasons, what should I do next?Missing Values In New DataHow to fill missing numeric if any value in a subset is missing, all other columns with the same subset are missingHandling NA Values in the Chicago Crime Rate data set

Hello this is my first machine learning project, I got a dataset with 18.000 rows and I have a column with 4244 values missing.

I don't know why the values are missing since when it's appropriate there's a 0 value in it.
The dtype of the column is int64 I consider this column usable and would like to implement it to the model.

Could you please help me with how to deal with this problem, or lead my to a resource to teach me how to deal with this ?

asked Apr 4 at 15:47

dungeon

394

add a comment |

Hello this is my first machine learning project, I got a dataset with 18.000 rows and I have a column with 4244 values missing.

Could you please help me with how to deal with this problem, or lead my to a resource to teach me how to deal with this ?

asked Apr 4 at 15:47

dungeon

394

add a comment |

Hello this is my first machine learning project, I got a dataset with 18.000 rows and I have a column with 4244 values missing.

Could you please help me with how to deal with this problem, or lead my to a resource to teach me how to deal with this ?

asked Apr 4 at 15:47

dungeon

394

Hello this is my first machine learning project, I got a dataset with 18.000 rows and I have a column with 4244 values missing.

Could you please help me with how to deal with this problem, or lead my to a resource to teach me how to deal with this ?

machine-learning dataset missing-data

asked Apr 4 at 15:47

dungeon

394

asked Apr 4 at 15:47

dungeon

394

asked Apr 4 at 15:47

dungeon

394

asked Apr 4 at 15:47

dungeon

394

asked Apr 4 at 15:47

dungeon

394

add a comment |

2 Answers
2

active

oldest

votes

I don't know why the values are missing since when it's appropriate there's a 0 value in it.

The first step is to check with some SME (Subject Matter Expert) or the Data Custodian. I can't tell you how many times I've built a model/started analysis just to figure out that the data was wrong. Try to figure out the reason behind the Nulls/0.

Besides that there are many ways to handle missing data a few are below:

Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)

Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).

Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial

Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!

answered Apr 4 at 20:03

MattR

1362

add a comment |

What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.

If you are using a Pandas DataFrame then you can do:

Replace with 0

df = df.fillna(0)

Replace with column mean

df = df.fillna(np.mean())

Replace with column median

df = df.fillna(np.median())

If you are using numpy you could do:

Replace with 0

X = np.nan_to_num(X)

Replace with mean

col_mean = np.nanmean(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_mean, inds[1])

Replace with median

col_median = np.nanmedian(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_median, inds[1])

If you want some reading: imputation strategies

edited Apr 4 at 18:13

answered Apr 4 at 16:04

Simon Larsson

935214

$begingroup$
So it's ok to just fill the column with nan values even if theres gonna be so many of them?
$endgroup$
– dungeon
Apr 4 at 17:21

$begingroup$
Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
$endgroup$
– Michael M
Apr 4 at 17:54

$begingroup$
I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
$endgroup$
– Simon Larsson
Apr 4 at 17:57

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48621%2fcolumn-with-many-missing-values-36%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I don't know why the values are missing since when it's appropriate there's a 0 value in it.

Besides that there are many ways to handle missing data a few are below:

Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)

Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).

Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial

Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!

answered Apr 4 at 20:03

MattR

1362

add a comment |

I don't know why the values are missing since when it's appropriate there's a 0 value in it.

Besides that there are many ways to handle missing data a few are below:

Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)

Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).

Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial

Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!

answered Apr 4 at 20:03

MattR

1362

add a comment |

I don't know why the values are missing since when it's appropriate there's a 0 value in it.

Besides that there are many ways to handle missing data a few are below:

Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)

Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).

Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial

Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!

answered Apr 4 at 20:03

MattR

1362

I don't know why the values are missing since when it's appropriate there's a 0 value in it.

Besides that there are many ways to handle missing data a few are below:

Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)

Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).

Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial

Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!

answered Apr 4 at 20:03

MattR

1362

answered Apr 4 at 20:03

MattR

1362

answered Apr 4 at 20:03

MattR

1362

answered Apr 4 at 20:03

MattR

1362

add a comment |

What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.

If you are using a Pandas DataFrame then you can do:

Replace with 0

df = df.fillna(0)

Replace with column mean

df = df.fillna(np.mean())

Replace with column median

df = df.fillna(np.median())

If you are using numpy you could do:

Replace with 0

X = np.nan_to_num(X)

Replace with mean

col_mean = np.nanmean(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_mean, inds[1])

Replace with median

col_median = np.nanmedian(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_median, inds[1])

If you want some reading: imputation strategies

edited Apr 4 at 18:13

answered Apr 4 at 16:04

Simon Larsson

935214

$begingroup$
So it's ok to just fill the column with nan values even if theres gonna be so many of them?
$endgroup$
– dungeon
Apr 4 at 17:21

$begingroup$
Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
$endgroup$
– Michael M
Apr 4 at 17:54

$begingroup$
I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
$endgroup$
– Simon Larsson
Apr 4 at 17:57

add a comment |

What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.

If you are using a Pandas DataFrame then you can do:

Replace with 0

df = df.fillna(0)

Replace with column mean

df = df.fillna(np.mean())

Replace with column median

df = df.fillna(np.median())

If you are using numpy you could do:

Replace with 0

X = np.nan_to_num(X)

Replace with mean

col_mean = np.nanmean(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_mean, inds[1])

Replace with median

col_median = np.nanmedian(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_median, inds[1])

If you want some reading: imputation strategies

edited Apr 4 at 18:13

answered Apr 4 at 16:04

Simon Larsson

935214

$begingroup$
So it's ok to just fill the column with nan values even if theres gonna be so many of them?
$endgroup$
– dungeon
Apr 4 at 17:21

$begingroup$
Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
$endgroup$
– Michael M
Apr 4 at 17:54

$begingroup$
I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
$endgroup$
– Simon Larsson
Apr 4 at 17:57

add a comment |

What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.

If you are using a Pandas DataFrame then you can do:

Replace with 0

df = df.fillna(0)

Replace with column mean

df = df.fillna(np.mean())

Replace with column median

df = df.fillna(np.median())

If you are using numpy you could do:

Replace with 0

X = np.nan_to_num(X)

Replace with mean

col_mean = np.nanmean(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_mean, inds[1])

Replace with median

col_median = np.nanmedian(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_median, inds[1])

If you want some reading: imputation strategies

edited Apr 4 at 18:13

answered Apr 4 at 16:04

Simon Larsson

935214

What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.

If you are using a Pandas DataFrame then you can do:

Replace with 0

df = df.fillna(0)

Replace with column mean

df = df.fillna(np.mean())

Replace with column median

df = df.fillna(np.median())

If you are using numpy you could do:

Replace with 0

X = np.nan_to_num(X)

Replace with mean

col_mean = np.nanmean(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_mean, inds[1])

Replace with median

col_median = np.nanmedian(X, axis=0)
inds = np.where(np.isnan(X))
X[inds] = np.take(col_median, inds[1])

If you want some reading: imputation strategies

edited Apr 4 at 18:13

answered Apr 4 at 16:04

Simon Larsson

935214

edited Apr 4 at 18:13

answered Apr 4 at 16:04

Simon Larsson

935214

answered Apr 4 at 16:04

Simon Larsson

935214

answered Apr 4 at 16:04

Simon Larsson

935214

$begingroup$
So it's ok to just fill the column with nan values even if theres gonna be so many of them?
$endgroup$
– dungeon
Apr 4 at 17:21

$begingroup$
Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
$endgroup$
– Michael M
Apr 4 at 17:54

$begingroup$
I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
$endgroup$
– Simon Larsson
Apr 4 at 17:57

add a comment |

$begingroup$
So it's ok to just fill the column with nan values even if theres gonna be so many of them?
$endgroup$
– dungeon
Apr 4 at 17:21

$begingroup$
Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
$endgroup$
– Michael M
Apr 4 at 17:54

$begingroup$
I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
$endgroup$
– Simon Larsson
Apr 4 at 17:57

So it's ok to just fill the column with nan values even if theres gonna be so many of them?

– dungeon
Apr 4 at 17:21

Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.

– Michael M
Apr 4 at 17:54

I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.

– Simon Larsson
Apr 4 at 17:57

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

XVw6 D5 gv JC3X4N,034bq,KIipLApf9t48ho3SERN97tE,5h7eM ozxxt1gVvW 3kZ

搜尋此網誌

Trjtdtk

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers
2

2 Answers
2

2 Answers
2