How to deal with name strings in large data sets for ML? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow to deal with string labels in multi-class classification with keras?Which machine (or deep) learning methods could suit my text classification problem?Delete strings with a specific last character in tibco spotfirehow to deal with varying output layerHow to deal with missing data for Bernoulli Naive Bayes?How to deal with attributes that can vary arbitrarily for each sample?How can I select a similarity threshold value for strings?Expanding mean (target) encoding utilized by CatBoost to deal with high cardinal categorical variables?How to deal with outliers in PythonWord classification (not text classification) using NLP

FME Console for testing

Fourier Transform of Airy Equation

Etymology of 見舞い

"Destructive force" carried by a B-52?

Is Vivien of the Wilds + Wilderness Reclamation a competitive combo?

Output the slug and name of a CPT single post taxonomy term

Kepler's 3rd law: ratios don't fit data

A German immigrant ancestor has a "Registration Affidavit of Alien Enemy" on file. What does that mean exactly?

Can gravitational waves pass through a black hole?

Sorting the characters in a utf-16 string in java

Why not use the yoke to control yaw, as well as pitch and roll?

false 'Security alert' from Google - every login generates mails from 'no-reply@accounts.google.com'

2 sample t test for sample sizes - 30,000 and 150,000

enable https on private network

Is the Mordenkainen's Sword spell underpowered?

What could prevent concentrated local exploration?

What kind of equipment or other technology is necessary to photograph sprites (atmospheric phenomenon)

How to mute a string and play another at the same time

Why did Bronn offer to be Tyrion Lannister's champion in trial by combat?

What helicopter has the most rotor blades?

“Since the train was delayed for more than an hour, passengers were given a full refund.” – Why is there no article before “passengers”?

What is the definining line between a helicopter and a drone a person can ride in?

Suing a Police Officer Instead of the Police Department

Raising a bilingual kid. When should we introduce the majority language?



How to deal with name strings in large data sets for ML?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsHow to deal with string labels in multi-class classification with keras?Which machine (or deep) learning methods could suit my text classification problem?Delete strings with a specific last character in tibco spotfirehow to deal with varying output layerHow to deal with missing data for Bernoulli Naive Bayes?How to deal with attributes that can vary arbitrarily for each sample?How can I select a similarity threshold value for strings?Expanding mean (target) encoding utilized by CatBoost to deal with high cardinal categorical variables?How to deal with outliers in PythonWord classification (not text classification) using NLP










0












$begingroup$


My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.



Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side (Label binarization) and no direct comparison between names on the other side (Label encoding).



Are there other approaches to use or transform especially name information in order to work with ML algorithms?










share|improve this question











$endgroup$
















    0












    $begingroup$


    My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.



    Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side (Label binarization) and no direct comparison between names on the other side (Label encoding).



    Are there other approaches to use or transform especially name information in order to work with ML algorithms?










    share|improve this question











    $endgroup$














      0












      0








      0





      $begingroup$


      My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.



      Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side (Label binarization) and no direct comparison between names on the other side (Label encoding).



      Are there other approaches to use or transform especially name information in order to work with ML algorithms?










      share|improve this question











      $endgroup$




      My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.



      Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side (Label binarization) and no direct comparison between names on the other side (Label encoding).



      Are there other approaches to use or transform especially name information in order to work with ML algorithms?







      python nlp preprocessing encoding classifier






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 6 at 10:20









      HFulcher

      11213




      11213










      asked Mar 6 at 10:06









      Danny AbstemioDanny Abstemio

      12




      12




















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.



          The following two videos will give an excellent explanation:



          • https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding

          • https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization

          However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.



          I hope this helps, any question let a comment.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Thanks for your answer. It helps to go on.
            $endgroup$
            – Danny Abstemio
            Mar 6 at 12:52











          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46771%2fhow-to-deal-with-name-strings-in-large-data-sets-for-ml%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0












          $begingroup$

          You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.



          The following two videos will give an excellent explanation:



          • https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding

          • https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization

          However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.



          I hope this helps, any question let a comment.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Thanks for your answer. It helps to go on.
            $endgroup$
            – Danny Abstemio
            Mar 6 at 12:52















          0












          $begingroup$

          You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.



          The following two videos will give an excellent explanation:



          • https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding

          • https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization

          However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.



          I hope this helps, any question let a comment.






          share|improve this answer









          $endgroup$












          • $begingroup$
            Thanks for your answer. It helps to go on.
            $endgroup$
            – Danny Abstemio
            Mar 6 at 12:52













          0












          0








          0





          $begingroup$

          You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.



          The following two videos will give an excellent explanation:



          • https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding

          • https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization

          However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.



          I hope this helps, any question let a comment.






          share|improve this answer









          $endgroup$



          You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.



          The following two videos will give an excellent explanation:



          • https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding

          • https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization

          However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.



          I hope this helps, any question let a comment.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 6 at 12:21









          Victor OliveiraVictor Oliveira

          3657




          3657











          • $begingroup$
            Thanks for your answer. It helps to go on.
            $endgroup$
            – Danny Abstemio
            Mar 6 at 12:52
















          • $begingroup$
            Thanks for your answer. It helps to go on.
            $endgroup$
            – Danny Abstemio
            Mar 6 at 12:52















          $begingroup$
          Thanks for your answer. It helps to go on.
          $endgroup$
          – Danny Abstemio
          Mar 6 at 12:52




          $begingroup$
          Thanks for your answer. It helps to go on.
          $endgroup$
          – Danny Abstemio
          Mar 6 at 12:52

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46771%2fhow-to-deal-with-name-strings-in-large-data-sets-for-ml%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

          Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High