How to deal with name strings in large data sets for ML? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHow to deal with string labels in multi-class classification with keras?Which machine (or deep) learning methods could suit my text classification problem?Delete strings with a specific last character in tibco spotfirehow to deal with varying output layerHow to deal with missing data for Bernoulli Naive Bayes?How to deal with attributes that can vary arbitrarily for each sample?How can I select a similarity threshold value for strings?Expanding mean (target) encoding utilized by CatBoost to deal with high cardinal categorical variables?How to deal with outliers in PythonWord classification (not text classification) using NLP
FME Console for testing
Fourier Transform of Airy Equation
Etymology of 見舞い
"Destructive force" carried by a B-52?
Is Vivien of the Wilds + Wilderness Reclamation a competitive combo?
Output the slug and name of a CPT single post taxonomy term
Kepler's 3rd law: ratios don't fit data
A German immigrant ancestor has a "Registration Affidavit of Alien Enemy" on file. What does that mean exactly?
Can gravitational waves pass through a black hole?
Sorting the characters in a utf-16 string in java
Why not use the yoke to control yaw, as well as pitch and roll?
false 'Security alert' from Google - every login generates mails from 'no-reply@accounts.google.com'
2 sample t test for sample sizes - 30,000 and 150,000
enable https on private network
Is the Mordenkainen's Sword spell underpowered?
What could prevent concentrated local exploration?
What kind of equipment or other technology is necessary to photograph sprites (atmospheric phenomenon)
How to mute a string and play another at the same time
Why did Bronn offer to be Tyrion Lannister's champion in trial by combat?
What helicopter has the most rotor blades?
“Since the train was delayed for more than an hour, passengers were given a full refund.” – Why is there no article before “passengers”?
What is the definining line between a helicopter and a drone a person can ride in?
Suing a Police Officer Instead of the Police Department
Raising a bilingual kid. When should we introduce the majority language?
How to deal with name strings in large data sets for ML?
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsHow to deal with string labels in multi-class classification with keras?Which machine (or deep) learning methods could suit my text classification problem?Delete strings with a specific last character in tibco spotfirehow to deal with varying output layerHow to deal with missing data for Bernoulli Naive Bayes?How to deal with attributes that can vary arbitrarily for each sample?How can I select a similarity threshold value for strings?Expanding mean (target) encoding utilized by CatBoost to deal with high cardinal categorical variables?How to deal with outliers in PythonWord classification (not text classification) using NLP
$begingroup$
My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.
Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side (Label binarization) and no direct comparison between names on the other side (Label encoding).
Are there other approaches to use or transform especially name information in order to work with ML algorithms?
python nlp preprocessing encoding classifier
$endgroup$
add a comment |
$begingroup$
My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.
Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side (Label binarization) and no direct comparison between names on the other side (Label encoding).
Are there other approaches to use or transform especially name information in order to work with ML algorithms?
python nlp preprocessing encoding classifier
$endgroup$
add a comment |
$begingroup$
My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.
Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side (Label binarization) and no direct comparison between names on the other side (Label encoding).
Are there other approaches to use or transform especially name information in order to work with ML algorithms?
python nlp preprocessing encoding classifier
$endgroup$
My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.
Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side (Label binarization) and no direct comparison between names on the other side (Label encoding).
Are there other approaches to use or transform especially name information in order to work with ML algorithms?
python nlp preprocessing encoding classifier
python nlp preprocessing encoding classifier
edited Mar 6 at 10:20
HFulcher
11213
11213
asked Mar 6 at 10:06
Danny AbstemioDanny Abstemio
12
12
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.
The following two videos will give an excellent explanation:
- https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding
- https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization
However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.
I hope this helps, any question let a comment.
$endgroup$
$begingroup$
Thanks for your answer. It helps to go on.
$endgroup$
– Danny Abstemio
Mar 6 at 12:52
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46771%2fhow-to-deal-with-name-strings-in-large-data-sets-for-ml%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.
The following two videos will give an excellent explanation:
- https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding
- https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization
However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.
I hope this helps, any question let a comment.
$endgroup$
$begingroup$
Thanks for your answer. It helps to go on.
$endgroup$
– Danny Abstemio
Mar 6 at 12:52
add a comment |
$begingroup$
You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.
The following two videos will give an excellent explanation:
- https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding
- https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization
However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.
I hope this helps, any question let a comment.
$endgroup$
$begingroup$
Thanks for your answer. It helps to go on.
$endgroup$
– Danny Abstemio
Mar 6 at 12:52
add a comment |
$begingroup$
You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.
The following two videos will give an excellent explanation:
- https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding
- https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization
However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.
I hope this helps, any question let a comment.
$endgroup$
You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.
The following two videos will give an excellent explanation:
- https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding
- https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization
However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.
I hope this helps, any question let a comment.
answered Mar 6 at 12:21
Victor OliveiraVictor Oliveira
3657
3657
$begingroup$
Thanks for your answer. It helps to go on.
$endgroup$
– Danny Abstemio
Mar 6 at 12:52
add a comment |
$begingroup$
Thanks for your answer. It helps to go on.
$endgroup$
– Danny Abstemio
Mar 6 at 12:52
$begingroup$
Thanks for your answer. It helps to go on.
$endgroup$
– Danny Abstemio
Mar 6 at 12:52
$begingroup$
Thanks for your answer. It helps to go on.
$endgroup$
– Danny Abstemio
Mar 6 at 12:52
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46771%2fhow-to-deal-with-name-strings-in-large-data-sets-for-ml%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown