Column With Many Missing Values (36%) Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsScikit Learn Missing Data - Categorical valuesHow to replace NA values with another value in factors in R?Fill missing values AND normaliseImputation missing values other than using Mean, Median in pythonwhat to do if the missing data in one column is based on some value/condition in another column in r?What is the difference between Missing at Random and Missing not at Random data?Investigate why data is missing? After finding out reasons, what should I do next?Missing Values In New DataHow to fill missing numeric if any value in a subset is missing, all other columns with the same subset are missingHandling NA Values in the Chicago Crime Rate data set

A term for a woman complaining about things/begging in a cute/childish way

How does Belgium enforce obligatory attendance in elections?

Putting class ranking in CV, but against dept guidelines

Where is the Data Import Wizard Error Log

How to pronounce 伝統色

Do wooden building fires get hotter than 600°C?

Converted a Scalar function to a TVF function for parallel execution-Still running in Serial mode

Google .dev domain strangely redirects to https

How to unroll a parameter pack from right to left

How did Fremen produce and carry enough thumpers to use Sandworms as de facto Ubers?

Project Euler #1 in C++

How to align multiple equations

Electrolysis of water: Which equations to use? (IB Chem)

How to dry out epoxy resin faster than usual?

How does a spellshard spellbook work?

A letter with no particular backstory

Is it fair for a professor to grade us on the possession of past papers?

If Windows 7 doesn't support WSL, then what is "Subsystem for UNIX-based Applications"?

Lagrange four-squares theorem --- deterministic complexity

What happened to Thoros of Myr's flaming sword?

How would a mousetrap for use in space work?

AppleTVs create a chatty alternate WiFi network

How does the math work when buying airline miles?

Significance of Cersei's obsession with elephants?



Column With Many Missing Values (36%)



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsScikit Learn Missing Data - Categorical valuesHow to replace NA values with another value in factors in R?Fill missing values AND normaliseImputation missing values other than using Mean, Median in pythonwhat to do if the missing data in one column is based on some value/condition in another column in r?What is the difference between Missing at Random and Missing not at Random data?Investigate why data is missing? After finding out reasons, what should I do next?Missing Values In New DataHow to fill missing numeric if any value in a subset is missing, all other columns with the same subset are missingHandling NA Values in the Chicago Crime Rate data set










1












$begingroup$


Hello this is my first machine learning project, I got a dataset with 18.000 rows and I have a column with 4244 values missing.



I don't know why the values are missing since when it's appropriate there's a 0 value in it.
The dtype of the column is int64 I consider this column usable and would like to implement it to the model.



Could you please help me with how to deal with this problem, or lead my to a resource to teach me how to deal with this ?










share|improve this question









$endgroup$
















    1












    $begingroup$


    Hello this is my first machine learning project, I got a dataset with 18.000 rows and I have a column with 4244 values missing.



    I don't know why the values are missing since when it's appropriate there's a 0 value in it.
    The dtype of the column is int64 I consider this column usable and would like to implement it to the model.



    Could you please help me with how to deal with this problem, or lead my to a resource to teach me how to deal with this ?










    share|improve this question









    $endgroup$














      1












      1








      1





      $begingroup$


      Hello this is my first machine learning project, I got a dataset with 18.000 rows and I have a column with 4244 values missing.



      I don't know why the values are missing since when it's appropriate there's a 0 value in it.
      The dtype of the column is int64 I consider this column usable and would like to implement it to the model.



      Could you please help me with how to deal with this problem, or lead my to a resource to teach me how to deal with this ?










      share|improve this question









      $endgroup$




      Hello this is my first machine learning project, I got a dataset with 18.000 rows and I have a column with 4244 values missing.



      I don't know why the values are missing since when it's appropriate there's a 0 value in it.
      The dtype of the column is int64 I consider this column usable and would like to implement it to the model.



      Could you please help me with how to deal with this problem, or lead my to a resource to teach me how to deal with this ?







      machine-learning dataset missing-data






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Apr 4 at 15:47









      dungeondungeon

      394




      394




















          2 Answers
          2






          active

          oldest

          votes


















          2












          $begingroup$


          I don't know why the values are missing since when it's appropriate there's a 0 value in it.




          The first step is to check with some SME (Subject Matter Expert) or the Data Custodian. I can't tell you how many times I've built a model/started analysis just to figure out that the data was wrong. Try to figure out the reason behind the Nulls/0.



          Besides that there are many ways to handle missing data a few are below:




          • Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)


          • Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).


          • Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial


          • Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!





          share|improve this answer









          $endgroup$




















            0












            $begingroup$

            What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.



            If you are using a Pandas DataFrame then you can do:



            Replace with 0



            df = df.fillna(0)


            Replace with column mean



            df = df.fillna(np.mean())


            Replace with column median



            df = df.fillna(np.median())


            If you are using numpy you could do:



            Replace with 0



            X = np.nan_to_num(X)


            Replace with mean



            col_mean = np.nanmean(X, axis=0)
            inds = np.where(np.isnan(X))
            X[inds] = np.take(col_mean, inds[1])


            Replace with median



            col_median = np.nanmedian(X, axis=0)
            inds = np.where(np.isnan(X))
            X[inds] = np.take(col_median, inds[1])


            If you want some reading: imputation strategies






            share|improve this answer











            $endgroup$












            • $begingroup$
              So it's ok to just fill the column with nan values even if theres gonna be so many of them?
              $endgroup$
              – dungeon
              Apr 4 at 17:21










            • $begingroup$
              Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
              $endgroup$
              – Michael M
              Apr 4 at 17:54










            • $begingroup$
              I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
              $endgroup$
              – Simon Larsson
              Apr 4 at 17:57











            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48621%2fcolumn-with-many-missing-values-36%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2












            $begingroup$


            I don't know why the values are missing since when it's appropriate there's a 0 value in it.




            The first step is to check with some SME (Subject Matter Expert) or the Data Custodian. I can't tell you how many times I've built a model/started analysis just to figure out that the data was wrong. Try to figure out the reason behind the Nulls/0.



            Besides that there are many ways to handle missing data a few are below:




            • Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)


            • Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).


            • Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial


            • Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!





            share|improve this answer









            $endgroup$

















              2












              $begingroup$


              I don't know why the values are missing since when it's appropriate there's a 0 value in it.




              The first step is to check with some SME (Subject Matter Expert) or the Data Custodian. I can't tell you how many times I've built a model/started analysis just to figure out that the data was wrong. Try to figure out the reason behind the Nulls/0.



              Besides that there are many ways to handle missing data a few are below:




              • Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)


              • Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).


              • Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial


              • Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!





              share|improve this answer









              $endgroup$















                2












                2








                2





                $begingroup$


                I don't know why the values are missing since when it's appropriate there's a 0 value in it.




                The first step is to check with some SME (Subject Matter Expert) or the Data Custodian. I can't tell you how many times I've built a model/started analysis just to figure out that the data was wrong. Try to figure out the reason behind the Nulls/0.



                Besides that there are many ways to handle missing data a few are below:




                • Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)


                • Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).


                • Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial


                • Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!





                share|improve this answer









                $endgroup$




                I don't know why the values are missing since when it's appropriate there's a 0 value in it.




                The first step is to check with some SME (Subject Matter Expert) or the Data Custodian. I can't tell you how many times I've built a model/started analysis just to figure out that the data was wrong. Try to figure out the reason behind the Nulls/0.



                Besides that there are many ways to handle missing data a few are below:




                • Remove records with this missing value in your column. If this is an important column to your model it may be best to get rid of that record depending on the shape (rows x cols/features) of your dataset. Don't throw off the results of your model because there's some data that may throw it off (even if you use some of the methods below)


                • Mean/Median/Mode Impute - A common method of handling missing data is to fill the missing values with the column's mean or median (rarely do you use the Mode).


                • Fill the values that creates a normal distribution - it depends on your data, but filling the values so you get normally distributed column data can be beneficial


                • Try all these methods and more - When you start modeling you'll learn to "throw stuff at the wall" and see what sticks. Look at your model results, talk with SMEs, and think about what makes sense. Some ways of handling missing data will work better with different models/datasets. Experiment and have fun!






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Apr 4 at 20:03









                MattRMattR

                1362




                1362





















                    0












                    $begingroup$

                    What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.



                    If you are using a Pandas DataFrame then you can do:



                    Replace with 0



                    df = df.fillna(0)


                    Replace with column mean



                    df = df.fillna(np.mean())


                    Replace with column median



                    df = df.fillna(np.median())


                    If you are using numpy you could do:



                    Replace with 0



                    X = np.nan_to_num(X)


                    Replace with mean



                    col_mean = np.nanmean(X, axis=0)
                    inds = np.where(np.isnan(X))
                    X[inds] = np.take(col_mean, inds[1])


                    Replace with median



                    col_median = np.nanmedian(X, axis=0)
                    inds = np.where(np.isnan(X))
                    X[inds] = np.take(col_median, inds[1])


                    If you want some reading: imputation strategies






                    share|improve this answer











                    $endgroup$












                    • $begingroup$
                      So it's ok to just fill the column with nan values even if theres gonna be so many of them?
                      $endgroup$
                      – dungeon
                      Apr 4 at 17:21










                    • $begingroup$
                      Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
                      $endgroup$
                      – Michael M
                      Apr 4 at 17:54










                    • $begingroup$
                      I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
                      $endgroup$
                      – Simon Larsson
                      Apr 4 at 17:57















                    0












                    $begingroup$

                    What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.



                    If you are using a Pandas DataFrame then you can do:



                    Replace with 0



                    df = df.fillna(0)


                    Replace with column mean



                    df = df.fillna(np.mean())


                    Replace with column median



                    df = df.fillna(np.median())


                    If you are using numpy you could do:



                    Replace with 0



                    X = np.nan_to_num(X)


                    Replace with mean



                    col_mean = np.nanmean(X, axis=0)
                    inds = np.where(np.isnan(X))
                    X[inds] = np.take(col_mean, inds[1])


                    Replace with median



                    col_median = np.nanmedian(X, axis=0)
                    inds = np.where(np.isnan(X))
                    X[inds] = np.take(col_median, inds[1])


                    If you want some reading: imputation strategies






                    share|improve this answer











                    $endgroup$












                    • $begingroup$
                      So it's ok to just fill the column with nan values even if theres gonna be so many of them?
                      $endgroup$
                      – dungeon
                      Apr 4 at 17:21










                    • $begingroup$
                      Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
                      $endgroup$
                      – Michael M
                      Apr 4 at 17:54










                    • $begingroup$
                      I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
                      $endgroup$
                      – Simon Larsson
                      Apr 4 at 17:57













                    0












                    0








                    0





                    $begingroup$

                    What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.



                    If you are using a Pandas DataFrame then you can do:



                    Replace with 0



                    df = df.fillna(0)


                    Replace with column mean



                    df = df.fillna(np.mean())


                    Replace with column median



                    df = df.fillna(np.median())


                    If you are using numpy you could do:



                    Replace with 0



                    X = np.nan_to_num(X)


                    Replace with mean



                    col_mean = np.nanmean(X, axis=0)
                    inds = np.where(np.isnan(X))
                    X[inds] = np.take(col_mean, inds[1])


                    Replace with median



                    col_median = np.nanmedian(X, axis=0)
                    inds = np.where(np.isnan(X))
                    X[inds] = np.take(col_median, inds[1])


                    If you want some reading: imputation strategies






                    share|improve this answer











                    $endgroup$



                    What you want to do is called imputation of missing values. There are some different strategies. Commonly you use the column mean, median or a value that serves as a good default.



                    If you are using a Pandas DataFrame then you can do:



                    Replace with 0



                    df = df.fillna(0)


                    Replace with column mean



                    df = df.fillna(np.mean())


                    Replace with column median



                    df = df.fillna(np.median())


                    If you are using numpy you could do:



                    Replace with 0



                    X = np.nan_to_num(X)


                    Replace with mean



                    col_mean = np.nanmean(X, axis=0)
                    inds = np.where(np.isnan(X))
                    X[inds] = np.take(col_mean, inds[1])


                    Replace with median



                    col_median = np.nanmedian(X, axis=0)
                    inds = np.where(np.isnan(X))
                    X[inds] = np.take(col_median, inds[1])


                    If you want some reading: imputation strategies







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Apr 4 at 18:13

























                    answered Apr 4 at 16:04









                    Simon LarssonSimon Larsson

                    935214




                    935214











                    • $begingroup$
                      So it's ok to just fill the column with nan values even if theres gonna be so many of them?
                      $endgroup$
                      – dungeon
                      Apr 4 at 17:21










                    • $begingroup$
                      Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
                      $endgroup$
                      – Michael M
                      Apr 4 at 17:54










                    • $begingroup$
                      I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
                      $endgroup$
                      – Simon Larsson
                      Apr 4 at 17:57
















                    • $begingroup$
                      So it's ok to just fill the column with nan values even if theres gonna be so many of them?
                      $endgroup$
                      – dungeon
                      Apr 4 at 17:21










                    • $begingroup$
                      Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
                      $endgroup$
                      – Michael M
                      Apr 4 at 17:54










                    • $begingroup$
                      I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
                      $endgroup$
                      – Simon Larsson
                      Apr 4 at 17:57















                    $begingroup$
                    So it's ok to just fill the column with nan values even if theres gonna be so many of them?
                    $endgroup$
                    – dungeon
                    Apr 4 at 17:21




                    $begingroup$
                    So it's ok to just fill the column with nan values even if theres gonna be so many of them?
                    $endgroup$
                    – dungeon
                    Apr 4 at 17:21












                    $begingroup$
                    Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
                    $endgroup$
                    – Michael M
                    Apr 4 at 17:54




                    $begingroup$
                    Not necessarily a good idea. Usually it is a good starting point to figure out the meaning of the variables and then decide about things like this.
                    $endgroup$
                    – Michael M
                    Apr 4 at 17:54












                    $begingroup$
                    I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
                    $endgroup$
                    – Simon Larsson
                    Apr 4 at 17:57




                    $begingroup$
                    I agree with @MichaelM. The zero was based on " it's appropriate there's a 0 value in it" which made me think you already decided on that as a default. I updated my answer with some more options.
                    $endgroup$
                    – Simon Larsson
                    Apr 4 at 17:57

















                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48621%2fcolumn-with-many-missing-values-36%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                    Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                    Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High