Aggregating target-encoded array-like categorical features?Ground-truth and feature extraction for predictive modellingDo categorical features always need to be encoded?Suitable aggregations (mean, median or something else) to make features?Preparing, Scaling and Selecting from a combination of numerical and categorical featuresCatboost Categorical Features Handling Options (CTR settings)?Removing Categorial Features in Linear RegressionOne hot encoding vs Word embeddingOne Hot Encoding vs Word Embeding - When to choose one or another?How to handle continuous values and a binary target?Target Encoding: missing value imputation before or after encoding

What was the state of the German rail system in 1944?

Is it cheaper to drop cargo than to land it?

Which industry am I working in? Software development or financial services?

Filling cracks with epoxy after Tung oil

SQL Server Always On File Share Witness (Quorum vote) on different subnet to other nodes

Manager is threatning to grade me poorly if I don't complete the project

I caught several of my students plagiarizing. Could it be my fault as a teacher?

A mathematically illogical argument in the derivation of Hamilton's equation in Goldstein

Why Isn’t SQL More Refactorable?

What is a "listed natural gas appliance"?

How encryption in SQL login authentication works

When and why did journal article titles become descriptive, rather than creatively allusive?

Why was the battle set up *outside* Winterfell?

Did we get closer to another plane than we were supposed to, or was the pilot just protecting our delicate sensibilities?

Automatically use long arrows in display mode

Why is B♯ higher than C♭ in 31-ET?

What is Shri Venkateshwara Mangalasasana stotram recited for?

Is this homebrew life-stealing melee cantrip unbalanced?

What are the differences between credential stuffing and password spraying?

Should I replace my bicycle tires if they have not been inflated in multiple years

Junior developer struggles: how to communicate with management?

Why isn't nylon as strong as kevlar?

Where can I go to avoid planes overhead?

What word means "to make something obsolete"?



Aggregating target-encoded array-like categorical features?


Ground-truth and feature extraction for predictive modellingDo categorical features always need to be encoded?Suitable aggregations (mean, median or something else) to make features?Preparing, Scaling and Selecting from a combination of numerical and categorical featuresCatboost Categorical Features Handling Options (CTR settings)?Removing Categorial Features in Linear RegressionOne hot encoding vs Word embeddingOne Hot Encoding vs Word Embeding - When to choose one or another?How to handle continuous values and a binary target?Target Encoding: missing value imputation before or after encoding













1












$begingroup$


I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables for machine learning classification algorithms.



One-hot encoding leads to very high dimensionality. The approach I've landed on is target-encoding/mean-encoding. I understand how to use this when the categorical feature is a single choice (eg current zip code). But, when the feature can take on multiple values from a large list (eg favorite hobbies, illness symptoms, university coursework), I am not sure how to combine the values.



My intuition says that the wrong approach would be to take each unique combination as its own factor and encode that, as it would lead to overfitting. Other things that come to mind would be simple aggregations like sum/avg/product/variance.



How should target encoded values be combined?










share|improve this question









$endgroup$
















    1












    $begingroup$


    I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables for machine learning classification algorithms.



    One-hot encoding leads to very high dimensionality. The approach I've landed on is target-encoding/mean-encoding. I understand how to use this when the categorical feature is a single choice (eg current zip code). But, when the feature can take on multiple values from a large list (eg favorite hobbies, illness symptoms, university coursework), I am not sure how to combine the values.



    My intuition says that the wrong approach would be to take each unique combination as its own factor and encode that, as it would lead to overfitting. Other things that come to mind would be simple aggregations like sum/avg/product/variance.



    How should target encoded values be combined?










    share|improve this question









    $endgroup$














      1












      1








      1





      $begingroup$


      I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables for machine learning classification algorithms.



      One-hot encoding leads to very high dimensionality. The approach I've landed on is target-encoding/mean-encoding. I understand how to use this when the categorical feature is a single choice (eg current zip code). But, when the feature can take on multiple values from a large list (eg favorite hobbies, illness symptoms, university coursework), I am not sure how to combine the values.



      My intuition says that the wrong approach would be to take each unique combination as its own factor and encode that, as it would lead to overfitting. Other things that come to mind would be simple aggregations like sum/avg/product/variance.



      How should target encoded values be combined?










      share|improve this question









      $endgroup$




      I am trying find commonly used techniques when dealing with high cardinality multi-valued categorical variables for machine learning classification algorithms.



      One-hot encoding leads to very high dimensionality. The approach I've landed on is target-encoding/mean-encoding. I understand how to use this when the categorical feature is a single choice (eg current zip code). But, when the feature can take on multiple values from a large list (eg favorite hobbies, illness symptoms, university coursework), I am not sure how to combine the values.



      My intuition says that the wrong approach would be to take each unique combination as its own factor and encode that, as it would lead to overfitting. Other things that come to mind would be simple aggregations like sum/avg/product/variance.



      How should target encoded values be combined?







      machine-learning feature-engineering encoding






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Apr 9 at 18:41









      user4446237user4446237

      1135




      1135




















          0






          active

          oldest

          votes












          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48987%2faggregating-target-encoded-array-like-categorical-features%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48987%2faggregating-target-encoded-array-like-categorical-features%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

          Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High