Target Encoding: missing value imputation before or after encodingMissing Categorical Features - no imputationMissing data imputation with KNNImputation of missing values and dealing with categorical valuesWhat approach for creating a multi-classification model based on all categorical features (1 with 5,000 levels)?Removing Categorial Features in Linear RegressionMissing value in continuous variable: Indicator variable vs. Indicator valueHow to handle large number of features in machine learning?Predicting a cyclic targetMissing Values In New DataTarget encoding with cross validation

Describing a chess game in a novel

What (if any) is the reason to buy in small local stores?

Pronounciation of the combination "st" in spanish accents

What does Jesus mean regarding "Raca," and "you fool?" - is he contrasting them?

In Aliens, how many people were on LV-426 before the Marines arrived​?

Synchronized implementation of a bank account in Java

Generic TVP tradeoffs?

Fewest number of steps to reach 200 using special calculator

Why is there so much iron?

How to define limit operations in general topological spaces? Are nets able to do this?

What is the significance behind "40 days" that often appears in the Bible?

What exactly term 'companion plants' means?

I seem to dance, I am not a dancer. Who am I?

Help prove this basic trig identity please!

Does .bashrc contain syntax errors?

Do native speakers use "ultima" and "proxima" frequently in spoken English?

HP P840 HDD RAID 5 many strange drive failures

A Ri-diddley-iley Riddle

Deletion of copy-ctor & copy-assignment - public, private or protected?

Print a physical multiplication table

How is the partial sum of a geometric sequence calculated?

Suggestions on how to spend Shaabath (constructively) alone

Probably overheated black color SMD pads

What can I do if I am asked to learn different programming languages very frequently?



Target Encoding: missing value imputation before or after encoding


Missing Categorical Features - no imputationMissing data imputation with KNNImputation of missing values and dealing with categorical valuesWhat approach for creating a multi-classification model based on all categorical features (1 with 5,000 levels)?Removing Categorial Features in Linear RegressionMissing value in continuous variable: Indicator variable vs. Indicator valueHow to handle large number of features in machine learning?Predicting a cyclic targetMissing Values In New DataTarget encoding with cross validation













1












$begingroup$


I want to perform a target encoding for my categorical features although I am not sure when to perform the data imputation if any of them has missing values.
Let's say I have a few continuous features, Cnt1-Cnt5 (without NA's) and two categorical features, Cat1 and Cat2, with Cat2 having missing values. Let's also assume that I want to use Random Forest as an imputation method. Which approach would be the correct one?



  1. Impute Cat2 treating Cat1 and Cnt1-Cnt5 as predictors in RF and then perform target encoding on categorical variables.


  2. Target encode Cat2 for non missing and Cat1, build RF and impute missings for Cat2 (which is now numeric, not categorical).


  3. Any other approach?


We can generalize this question and ask whether we should impute missings for any kind of variable (including continuous) before or after target encoding?



I see at least one benefit of imputation after target encoding - if there are unseen levels of categorical variable present in the test data (which will result in NA's in the test set after performing target encoding), those would be easily imputed by RF built on training data, without any potential error due to new levels.










share|improve this question









New contributor




MarkSt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$
















    1












    $begingroup$


    I want to perform a target encoding for my categorical features although I am not sure when to perform the data imputation if any of them has missing values.
    Let's say I have a few continuous features, Cnt1-Cnt5 (without NA's) and two categorical features, Cat1 and Cat2, with Cat2 having missing values. Let's also assume that I want to use Random Forest as an imputation method. Which approach would be the correct one?



    1. Impute Cat2 treating Cat1 and Cnt1-Cnt5 as predictors in RF and then perform target encoding on categorical variables.


    2. Target encode Cat2 for non missing and Cat1, build RF and impute missings for Cat2 (which is now numeric, not categorical).


    3. Any other approach?


    We can generalize this question and ask whether we should impute missings for any kind of variable (including continuous) before or after target encoding?



    I see at least one benefit of imputation after target encoding - if there are unseen levels of categorical variable present in the test data (which will result in NA's in the test set after performing target encoding), those would be easily imputed by RF built on training data, without any potential error due to new levels.










    share|improve this question









    New contributor




    MarkSt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$














      1












      1








      1





      $begingroup$


      I want to perform a target encoding for my categorical features although I am not sure when to perform the data imputation if any of them has missing values.
      Let's say I have a few continuous features, Cnt1-Cnt5 (without NA's) and two categorical features, Cat1 and Cat2, with Cat2 having missing values. Let's also assume that I want to use Random Forest as an imputation method. Which approach would be the correct one?



      1. Impute Cat2 treating Cat1 and Cnt1-Cnt5 as predictors in RF and then perform target encoding on categorical variables.


      2. Target encode Cat2 for non missing and Cat1, build RF and impute missings for Cat2 (which is now numeric, not categorical).


      3. Any other approach?


      We can generalize this question and ask whether we should impute missings for any kind of variable (including continuous) before or after target encoding?



      I see at least one benefit of imputation after target encoding - if there are unseen levels of categorical variable present in the test data (which will result in NA's in the test set after performing target encoding), those would be easily imputed by RF built on training data, without any potential error due to new levels.










      share|improve this question









      New contributor




      MarkSt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I want to perform a target encoding for my categorical features although I am not sure when to perform the data imputation if any of them has missing values.
      Let's say I have a few continuous features, Cnt1-Cnt5 (without NA's) and two categorical features, Cat1 and Cat2, with Cat2 having missing values. Let's also assume that I want to use Random Forest as an imputation method. Which approach would be the correct one?



      1. Impute Cat2 treating Cat1 and Cnt1-Cnt5 as predictors in RF and then perform target encoding on categorical variables.


      2. Target encode Cat2 for non missing and Cat1, build RF and impute missings for Cat2 (which is now numeric, not categorical).


      3. Any other approach?


      We can generalize this question and ask whether we should impute missings for any kind of variable (including continuous) before or after target encoding?



      I see at least one benefit of imputation after target encoding - if there are unseen levels of categorical variable present in the test data (which will result in NA's in the test set after performing target encoding), those would be easily imputed by RF built on training data, without any potential error due to new levels.







      feature-engineering encoding data-imputation






      share|improve this question









      New contributor




      MarkSt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      MarkSt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited yesterday







      MarkSt













      New contributor




      MarkSt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked yesterday









      MarkStMarkSt

      62




      62




      New contributor




      MarkSt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      MarkSt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      MarkSt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















          0






          active

          oldest

          votes











          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );






          MarkSt is a new contributor. Be nice, and check out our Code of Conduct.









          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47409%2ftarget-encoding-missing-value-imputation-before-or-after-encoding%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          MarkSt is a new contributor. Be nice, and check out our Code of Conduct.









          draft saved

          draft discarded


















          MarkSt is a new contributor. Be nice, and check out our Code of Conduct.












          MarkSt is a new contributor. Be nice, and check out our Code of Conduct.











          MarkSt is a new contributor. Be nice, and check out our Code of Conduct.














          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47409%2ftarget-encoding-missing-value-imputation-before-or-after-encoding%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

          Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High