How to deal with name strings in large data sets for ML? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow to deal with string labels in multi-class classification with keras?Which machine (or deep) learning methods could suit my text classification problem?Delete strings with a specific last character in tibco spotfirehow to deal with varying output layerHow to deal with missing data for Bernoulli Naive Bayes?How to deal with attributes that can vary arbitrarily for each sample?How can I select a similarity threshold value for strings?Expanding mean (target) encoding utilized by CatBoost to deal with high cardinal categorical variables?How to deal with outliers in PythonWord classification (not text classification) using NLP

FME Console for testing

Fourier Transform of Airy Equation

Etymology of 見舞い

"Destructive force" carried by a B-52?

Is Vivien of the Wilds + Wilderness Reclamation a competitive combo?

Output the slug and name of a CPT single post taxonomy term

Kepler's 3rd law: ratios don't fit data

A German immigrant ancestor has a "Registration Affidavit of Alien Enemy" on file. What does that mean exactly?

Can gravitational waves pass through a black hole?

Sorting the characters in a utf-16 string in java

Why not use the yoke to control yaw, as well as pitch and roll?

false 'Security alert' from Google - every login generates mails from 'no-reply@accounts.google.com'

2 sample t test for sample sizes - 30,000 and 150,000

enable https on private network

Is the Mordenkainen's Sword spell underpowered?

What could prevent concentrated local exploration?

What kind of equipment or other technology is necessary to photograph sprites (atmospheric phenomenon)

How to mute a string and play another at the same time

Why did Bronn offer to be Tyrion Lannister's champion in trial by combat?

What helicopter has the most rotor blades?

“Since the train was delayed for more than an hour, passengers were given a full refund.” – Why is there no article before “passengers”?

What is the definining line between a helicopter and a drone a person can ride in?

Suing a Police Officer Instead of the Police Department

Raising a bilingual kid. When should we introduce the majority language?

How to deal with name strings in large data sets for ML?

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsHow to deal with string labels in multi-class classification with keras?Which machine (or deep) learning methods could suit my text classification problem?Delete strings with a specific last character in tibco spotfirehow to deal with varying output layerHow to deal with missing data for Bernoulli Naive Bayes?How to deal with attributes that can vary arbitrarily for each sample?How can I select a similarity threshold value for strings?Expanding mean (target) encoding utilized by CatBoost to deal with high cardinal categorical variables?How to deal with outliers in PythonWord classification (not text classification) using NLP

My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.

Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side (Label binarization) and no direct comparison between names on the other side (Label encoding).

Are there other approaches to use or transform especially name information in order to work with ML algorithms?

edited Mar 6 at 10:20

HFulcher

11213

asked Mar 6 at 10:06

Danny Abstemio

add a comment |

My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.

Are there other approaches to use or transform especially name information in order to work with ML algorithms?

edited Mar 6 at 10:20

HFulcher

11213

asked Mar 6 at 10:06

Danny Abstemio

add a comment |

My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.

Are there other approaches to use or transform especially name information in order to work with ML algorithms?

edited Mar 6 at 10:20

HFulcher

11213

asked Mar 6 at 10:06

Danny Abstemio

My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.

Are there other approaches to use or transform especially name information in order to work with ML algorithms?

python nlp preprocessing encoding classifier

edited Mar 6 at 10:20

HFulcher

11213

asked Mar 6 at 10:06

Danny Abstemio

edited Mar 6 at 10:20

HFulcher

11213

asked Mar 6 at 10:06

Danny Abstemio

edited Mar 6 at 10:20

HFulcher

11213

edited Mar 6 at 10:20

HFulcher

11213

edited Mar 6 at 10:20

HFulcher

11213

asked Mar 6 at 10:06

Danny Abstemio

asked Mar 6 at 10:06

Danny Abstemio

asked Mar 6 at 10:06

Danny Abstemio

add a comment |

1 Answer
1

active

oldest

votes

You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.

The following two videos will give an excellent explanation:

https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding

https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization

However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.

I hope this helps, any question let a comment.

answered Mar 6 at 12:21

Victor Oliveira

3657

$begingroup$
Thanks for your answer. It helps to go on.
$endgroup$
– Danny Abstemio
Mar 6 at 12:52

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46771%2fhow-to-deal-with-name-strings-in-large-data-sets-for-ml%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The following two videos will give an excellent explanation:

https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding

https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization

However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.

I hope this helps, any question let a comment.

answered Mar 6 at 12:21

Victor Oliveira

3657

$begingroup$
Thanks for your answer. It helps to go on.
$endgroup$
– Danny Abstemio
Mar 6 at 12:52

add a comment |

The following two videos will give an excellent explanation:

https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding

https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization

However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.

I hope this helps, any question let a comment.

answered Mar 6 at 12:21

Victor Oliveira

3657

$begingroup$
Thanks for your answer. It helps to go on.
$endgroup$
– Danny Abstemio
Mar 6 at 12:52

add a comment |

The following two videos will give an excellent explanation:

https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding

https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization

However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.

I hope this helps, any question let a comment.

answered Mar 6 at 12:21

Victor Oliveira

3657

The following two videos will give an excellent explanation:

https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding

https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization

However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.

I hope this helps, any question let a comment.

answered Mar 6 at 12:21

Victor Oliveira

3657

answered Mar 6 at 12:21

Victor Oliveira

3657

answered Mar 6 at 12:21

Victor Oliveira

3657

answered Mar 6 at 12:21

Victor Oliveira

3657

$begingroup$
Thanks for your answer. It helps to go on.
$endgroup$
– Danny Abstemio
Mar 6 at 12:52

add a comment |

$begingroup$
Thanks for your answer. It helps to go on.
$endgroup$
– Danny Abstemio
Mar 6 at 12:52

Thanks for your answer. It helps to go on.

– Danny Abstemio
Mar 6 at 12:52

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

1 Answer
1

1 Answer
1

1 Answer
1