Fuzzy name and nickname matchHow to detect product name from the bill text?Accuracy doesn't match in KerasName Tagger in Stanford NLPHow to programmatically suggest changes to a text to match some other text?How to filter Named Entity Recognition resultsBest way to extract information from text description and match it with set of wordsNLP: To remove verb and find the match in a sentenceImportError: cannot import name 'StanfordCoreNLPParser'Determining first name and surname from a name pair?NLP: Fuzzy Word/Phrase Match
Is it possible to have a strip of cold climate in the middle of a planet?
What is this called? Old film camera viewer?
React - map array to child component
Bob has never been a M before
How can Trident be so inexpensive? Will it orbit Triton or just do a (slow) flyby?
Is the U.S. Code copyrighted by the Government?
What prevents the use of a multi-segment ILS for non-straight approaches?
Need a math help for the Cagan's model in macroeconomics
Loading commands from file
Is there a single word describing earning money through any means?
Open a doc from terminal, but not by its name
Strong empirical falsification of quantum mechanics based on vacuum energy density
What was the exact wording from Ivanhoe of this advice on how to free yourself from slavery?
Can somebody explain the brexit thing in one or two child-proof sentences?
why `nmap 192.168.1.97` returns less services than `nmap 127.0.0.1`?
what is different between Do you interest vs interested in something?
How do you respond to a colleague from another team when they're wrongly expecting that you'll help them?
Store Credit Card Information in Password Manager?
Pre-mixing cryogenic fuels and using only one fuel tank
Is this toilet slogan correct usage of the English language?
What is the difference between Reference and Background image in 2.8
Melting point of aspirin, contradicting sources
Why should universal income be universal?
Is there a name for this algorithm to calculate the concentration of a mixture of two solutions containing the same solute?
Fuzzy name and nickname match
How to detect product name from the bill text?Accuracy doesn't match in KerasName Tagger in Stanford NLPHow to programmatically suggest changes to a text to match some other text?How to filter Named Entity Recognition resultsBest way to extract information from text description and match it with set of wordsNLP: To remove verb and find the match in a sentenceImportError: cannot import name 'StanfordCoreNLPParser'Determining first name and surname from a name pair?NLP: Fuzzy Word/Phrase Match
$begingroup$
I have a dataset with the following structure:
full_name,nickname,match
Christian Douglas,Chris,1,
Jhon Stevens,Charlie,0,
David Jr Simpson,Junior,1
Anastasia Williams,Stacie,1
Lara Williams,Ana,0
John Williams,Willy,1
where each predictor row is a pair full name, nickname, and the target variable, match, which is 1 when the nickname corresponds to the person with that name and 0 otherwise. As you can see, the way the nickname is obtained from the full name doesn't follow a particular pattern.
I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.
My baseline is just trying to see the number of carachters that match, and features like that. However, I am thinking about an NLP approach using deep learning. My question is wether there are neural network architectures that are specific of this problem.
deep-learning nlp
$endgroup$
This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.
Looking for an answer drawing from credible and/or official sources.
Should give an important overview/state of the art of the question
add a comment |
$begingroup$
I have a dataset with the following structure:
full_name,nickname,match
Christian Douglas,Chris,1,
Jhon Stevens,Charlie,0,
David Jr Simpson,Junior,1
Anastasia Williams,Stacie,1
Lara Williams,Ana,0
John Williams,Willy,1
where each predictor row is a pair full name, nickname, and the target variable, match, which is 1 when the nickname corresponds to the person with that name and 0 otherwise. As you can see, the way the nickname is obtained from the full name doesn't follow a particular pattern.
I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.
My baseline is just trying to see the number of carachters that match, and features like that. However, I am thinking about an NLP approach using deep learning. My question is wether there are neural network architectures that are specific of this problem.
deep-learning nlp
$endgroup$
This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.
Looking for an answer drawing from credible and/or official sources.
Should give an important overview/state of the art of the question
1
$begingroup$
How man samples are there for training ?
$endgroup$
– Shamit Verma
Mar 19 at 13:38
$begingroup$
Around 100k, but only the 17% have match = 0
$endgroup$
– David Masip
Mar 19 at 13:46
$begingroup$
Is this an open dataset that you can share a link to?
$endgroup$
– Adarsh Chavakula
yesterday
$begingroup$
how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
$endgroup$
– iamklaus
5 hours ago
add a comment |
$begingroup$
I have a dataset with the following structure:
full_name,nickname,match
Christian Douglas,Chris,1,
Jhon Stevens,Charlie,0,
David Jr Simpson,Junior,1
Anastasia Williams,Stacie,1
Lara Williams,Ana,0
John Williams,Willy,1
where each predictor row is a pair full name, nickname, and the target variable, match, which is 1 when the nickname corresponds to the person with that name and 0 otherwise. As you can see, the way the nickname is obtained from the full name doesn't follow a particular pattern.
I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.
My baseline is just trying to see the number of carachters that match, and features like that. However, I am thinking about an NLP approach using deep learning. My question is wether there are neural network architectures that are specific of this problem.
deep-learning nlp
$endgroup$
I have a dataset with the following structure:
full_name,nickname,match
Christian Douglas,Chris,1,
Jhon Stevens,Charlie,0,
David Jr Simpson,Junior,1
Anastasia Williams,Stacie,1
Lara Williams,Ana,0
John Williams,Willy,1
where each predictor row is a pair full name, nickname, and the target variable, match, which is 1 when the nickname corresponds to the person with that name and 0 otherwise. As you can see, the way the nickname is obtained from the full name doesn't follow a particular pattern.
I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.
My baseline is just trying to see the number of carachters that match, and features like that. However, I am thinking about an NLP approach using deep learning. My question is wether there are neural network architectures that are specific of this problem.
deep-learning nlp
deep-learning nlp
edited Mar 19 at 14:17
David Masip
asked Mar 19 at 13:36
David MasipDavid Masip
2,4941326
2,4941326
This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.
Looking for an answer drawing from credible and/or official sources.
Should give an important overview/state of the art of the question
This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.
Looking for an answer drawing from credible and/or official sources.
Should give an important overview/state of the art of the question
1
$begingroup$
How man samples are there for training ?
$endgroup$
– Shamit Verma
Mar 19 at 13:38
$begingroup$
Around 100k, but only the 17% have match = 0
$endgroup$
– David Masip
Mar 19 at 13:46
$begingroup$
Is this an open dataset that you can share a link to?
$endgroup$
– Adarsh Chavakula
yesterday
$begingroup$
how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
$endgroup$
– iamklaus
5 hours ago
add a comment |
1
$begingroup$
How man samples are there for training ?
$endgroup$
– Shamit Verma
Mar 19 at 13:38
$begingroup$
Around 100k, but only the 17% have match = 0
$endgroup$
– David Masip
Mar 19 at 13:46
$begingroup$
Is this an open dataset that you can share a link to?
$endgroup$
– Adarsh Chavakula
yesterday
$begingroup$
how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
$endgroup$
– iamklaus
5 hours ago
1
1
$begingroup$
How man samples are there for training ?
$endgroup$
– Shamit Verma
Mar 19 at 13:38
$begingroup$
How man samples are there for training ?
$endgroup$
– Shamit Verma
Mar 19 at 13:38
$begingroup$
Around 100k, but only the 17% have match = 0
$endgroup$
– David Masip
Mar 19 at 13:46
$begingroup$
Around 100k, but only the 17% have match = 0
$endgroup$
– David Masip
Mar 19 at 13:46
$begingroup$
Is this an open dataset that you can share a link to?
$endgroup$
– Adarsh Chavakula
yesterday
$begingroup$
Is this an open dataset that you can share a link to?
$endgroup$
– Adarsh Chavakula
yesterday
$begingroup$
how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
$endgroup$
– iamklaus
5 hours ago
$begingroup$
how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
$endgroup$
– iamklaus
5 hours ago
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
I couldn't find any useful literature out there for using deep learning for this specific problem. Most methods seem to rely on non-machine learning methods like string similarities and Levenstein distances. A reasonable deep learning based approach to this problem would be a Recurrent Neural Network. An LSTM (Long short term memory) or GRU (Gated Recurrent Unit) would be ideal. The idea is to have an RNN which has an internal state and respects the order in which the inputs are fed.
Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.
For example
Christian Douglas,Chris,1
Jhon Stevens,Charlie,0
would become
[C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
[J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]
The two strings to be matched are concatenated into a single sequence. The intuition here is that the RNN would process the sequence character by character and learn (read update weights) that the characters at the end have a similar pattern to what it saw earlier in the same sequence to deduce that it should be a 1 instead of 0.
The vector of [1/0] is the target variable.
The standard RNN pre-processing steps apply as usual - we'll pad the sequences in the beginning so that they're all the same length (say 50), the characters would be encoded as numeric instead of string etc.
Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.
Framing the problem in this manner allows us to use a vanilla RNN or an out-of-the-box LSTM/GRU instead of creating a custom architecture that takes two separate strings as input for each data point and throws out a number.
You could give this approach a shot and see if is able to satisfactorily beats baseline models.
A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.
$endgroup$
add a comment |
$begingroup$
I had a similar problem in my last job. My solution was to build features via (transformation(s) + comparison) * many combos and feed to models, then aggregate and model, i.e. 2 layer model. The key is encoding and similarity scores as features.
Transforms: remove vowels (great for certain roots), remove end vowels, remove double characters, convert to phonetic string (IPA, soundex, https://pypi.org/project/Fuzzy/), replace characters that either sound similar or have different sounds in other languages ($J$ in East Europe sounds like $Y$ in US, $C$ can sound like $K, D~T, Tsim TH$, etc), ... The strategy is to handle lots of weirdness/irregularity in people's names.
Comparisons (similarity and difference): try [character level, block/root/[pre/suf]fix level, word level (may not apply to you)] similarity and difference scores. Try Dice's coefficient, Levenshtein, Needleman–Wunsch, Longest common (non)contiguous substring, character histogram similarity, # characters matching, not matching (each left and right), etc. You could try using an RNN/LSTM and have it learn similarity for each transform. Use the output of the trained model(s) as another feature.
Experiment with different combos of the above and select a few that seem to have value. You could simply take all the scores and fit with Logistic Regression (or Neural Net) or you could build statistical models and output percent rank based on a small training set to normalize it. Another way to preprocess the raw scores is to use calibration encoding via logistic function. Then add summary stats from the normalized scores as additional features. Push all this into the final model.
Will you handle names that are derived from Arabic, Spanish, French, etc names? This is just extra, but consider downloading the Social Security and US Census name stats data to enhance your project with more name variations. I'll leave the how to you, but it helps knowing about the likely possibilities. Be aware that simply using Levenshtein doesn't work so well with William->Bill, Dianne->Di, Larry->Lawrence, Mohammed->Muhamed and Hamed, Danielle->Daniela, Thomas->Tom, and Jimmy->James. The strategy I mentioned should help you with all the variation.
Additional resources to explore:
https://github.com/jamesturk/jellyfish
https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
https://pypi.org/project/phonetics/
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47617%2ffuzzy-name-and-nickname-match%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I couldn't find any useful literature out there for using deep learning for this specific problem. Most methods seem to rely on non-machine learning methods like string similarities and Levenstein distances. A reasonable deep learning based approach to this problem would be a Recurrent Neural Network. An LSTM (Long short term memory) or GRU (Gated Recurrent Unit) would be ideal. The idea is to have an RNN which has an internal state and respects the order in which the inputs are fed.
Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.
For example
Christian Douglas,Chris,1
Jhon Stevens,Charlie,0
would become
[C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
[J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]
The two strings to be matched are concatenated into a single sequence. The intuition here is that the RNN would process the sequence character by character and learn (read update weights) that the characters at the end have a similar pattern to what it saw earlier in the same sequence to deduce that it should be a 1 instead of 0.
The vector of [1/0] is the target variable.
The standard RNN pre-processing steps apply as usual - we'll pad the sequences in the beginning so that they're all the same length (say 50), the characters would be encoded as numeric instead of string etc.
Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.
Framing the problem in this manner allows us to use a vanilla RNN or an out-of-the-box LSTM/GRU instead of creating a custom architecture that takes two separate strings as input for each data point and throws out a number.
You could give this approach a shot and see if is able to satisfactorily beats baseline models.
A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.
$endgroup$
add a comment |
$begingroup$
I couldn't find any useful literature out there for using deep learning for this specific problem. Most methods seem to rely on non-machine learning methods like string similarities and Levenstein distances. A reasonable deep learning based approach to this problem would be a Recurrent Neural Network. An LSTM (Long short term memory) or GRU (Gated Recurrent Unit) would be ideal. The idea is to have an RNN which has an internal state and respects the order in which the inputs are fed.
Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.
For example
Christian Douglas,Chris,1
Jhon Stevens,Charlie,0
would become
[C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
[J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]
The two strings to be matched are concatenated into a single sequence. The intuition here is that the RNN would process the sequence character by character and learn (read update weights) that the characters at the end have a similar pattern to what it saw earlier in the same sequence to deduce that it should be a 1 instead of 0.
The vector of [1/0] is the target variable.
The standard RNN pre-processing steps apply as usual - we'll pad the sequences in the beginning so that they're all the same length (say 50), the characters would be encoded as numeric instead of string etc.
Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.
Framing the problem in this manner allows us to use a vanilla RNN or an out-of-the-box LSTM/GRU instead of creating a custom architecture that takes two separate strings as input for each data point and throws out a number.
You could give this approach a shot and see if is able to satisfactorily beats baseline models.
A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.
$endgroup$
add a comment |
$begingroup$
I couldn't find any useful literature out there for using deep learning for this specific problem. Most methods seem to rely on non-machine learning methods like string similarities and Levenstein distances. A reasonable deep learning based approach to this problem would be a Recurrent Neural Network. An LSTM (Long short term memory) or GRU (Gated Recurrent Unit) would be ideal. The idea is to have an RNN which has an internal state and respects the order in which the inputs are fed.
Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.
For example
Christian Douglas,Chris,1
Jhon Stevens,Charlie,0
would become
[C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
[J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]
The two strings to be matched are concatenated into a single sequence. The intuition here is that the RNN would process the sequence character by character and learn (read update weights) that the characters at the end have a similar pattern to what it saw earlier in the same sequence to deduce that it should be a 1 instead of 0.
The vector of [1/0] is the target variable.
The standard RNN pre-processing steps apply as usual - we'll pad the sequences in the beginning so that they're all the same length (say 50), the characters would be encoded as numeric instead of string etc.
Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.
Framing the problem in this manner allows us to use a vanilla RNN or an out-of-the-box LSTM/GRU instead of creating a custom architecture that takes two separate strings as input for each data point and throws out a number.
You could give this approach a shot and see if is able to satisfactorily beats baseline models.
A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.
$endgroup$
I couldn't find any useful literature out there for using deep learning for this specific problem. Most methods seem to rely on non-machine learning methods like string similarities and Levenstein distances. A reasonable deep learning based approach to this problem would be a Recurrent Neural Network. An LSTM (Long short term memory) or GRU (Gated Recurrent Unit) would be ideal. The idea is to have an RNN which has an internal state and respects the order in which the inputs are fed.
Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.
For example
Christian Douglas,Chris,1
Jhon Stevens,Charlie,0
would become
[C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
[J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]
The two strings to be matched are concatenated into a single sequence. The intuition here is that the RNN would process the sequence character by character and learn (read update weights) that the characters at the end have a similar pattern to what it saw earlier in the same sequence to deduce that it should be a 1 instead of 0.
The vector of [1/0] is the target variable.
The standard RNN pre-processing steps apply as usual - we'll pad the sequences in the beginning so that they're all the same length (say 50), the characters would be encoded as numeric instead of string etc.
Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.
Framing the problem in this manner allows us to use a vanilla RNN or an out-of-the-box LSTM/GRU instead of creating a custom architecture that takes two separate strings as input for each data point and throws out a number.
You could give this approach a shot and see if is able to satisfactorily beats baseline models.
A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.
answered yesterday
Adarsh ChavakulaAdarsh Chavakula
25525
25525
add a comment |
add a comment |
$begingroup$
I had a similar problem in my last job. My solution was to build features via (transformation(s) + comparison) * many combos and feed to models, then aggregate and model, i.e. 2 layer model. The key is encoding and similarity scores as features.
Transforms: remove vowels (great for certain roots), remove end vowels, remove double characters, convert to phonetic string (IPA, soundex, https://pypi.org/project/Fuzzy/), replace characters that either sound similar or have different sounds in other languages ($J$ in East Europe sounds like $Y$ in US, $C$ can sound like $K, D~T, Tsim TH$, etc), ... The strategy is to handle lots of weirdness/irregularity in people's names.
Comparisons (similarity and difference): try [character level, block/root/[pre/suf]fix level, word level (may not apply to you)] similarity and difference scores. Try Dice's coefficient, Levenshtein, Needleman–Wunsch, Longest common (non)contiguous substring, character histogram similarity, # characters matching, not matching (each left and right), etc. You could try using an RNN/LSTM and have it learn similarity for each transform. Use the output of the trained model(s) as another feature.
Experiment with different combos of the above and select a few that seem to have value. You could simply take all the scores and fit with Logistic Regression (or Neural Net) or you could build statistical models and output percent rank based on a small training set to normalize it. Another way to preprocess the raw scores is to use calibration encoding via logistic function. Then add summary stats from the normalized scores as additional features. Push all this into the final model.
Will you handle names that are derived from Arabic, Spanish, French, etc names? This is just extra, but consider downloading the Social Security and US Census name stats data to enhance your project with more name variations. I'll leave the how to you, but it helps knowing about the likely possibilities. Be aware that simply using Levenshtein doesn't work so well with William->Bill, Dianne->Di, Larry->Lawrence, Mohammed->Muhamed and Hamed, Danielle->Daniela, Thomas->Tom, and Jimmy->James. The strategy I mentioned should help you with all the variation.
Additional resources to explore:
https://github.com/jamesturk/jellyfish
https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
https://pypi.org/project/phonetics/
New contributor
$endgroup$
add a comment |
$begingroup$
I had a similar problem in my last job. My solution was to build features via (transformation(s) + comparison) * many combos and feed to models, then aggregate and model, i.e. 2 layer model. The key is encoding and similarity scores as features.
Transforms: remove vowels (great for certain roots), remove end vowels, remove double characters, convert to phonetic string (IPA, soundex, https://pypi.org/project/Fuzzy/), replace characters that either sound similar or have different sounds in other languages ($J$ in East Europe sounds like $Y$ in US, $C$ can sound like $K, D~T, Tsim TH$, etc), ... The strategy is to handle lots of weirdness/irregularity in people's names.
Comparisons (similarity and difference): try [character level, block/root/[pre/suf]fix level, word level (may not apply to you)] similarity and difference scores. Try Dice's coefficient, Levenshtein, Needleman–Wunsch, Longest common (non)contiguous substring, character histogram similarity, # characters matching, not matching (each left and right), etc. You could try using an RNN/LSTM and have it learn similarity for each transform. Use the output of the trained model(s) as another feature.
Experiment with different combos of the above and select a few that seem to have value. You could simply take all the scores and fit with Logistic Regression (or Neural Net) or you could build statistical models and output percent rank based on a small training set to normalize it. Another way to preprocess the raw scores is to use calibration encoding via logistic function. Then add summary stats from the normalized scores as additional features. Push all this into the final model.
Will you handle names that are derived from Arabic, Spanish, French, etc names? This is just extra, but consider downloading the Social Security and US Census name stats data to enhance your project with more name variations. I'll leave the how to you, but it helps knowing about the likely possibilities. Be aware that simply using Levenshtein doesn't work so well with William->Bill, Dianne->Di, Larry->Lawrence, Mohammed->Muhamed and Hamed, Danielle->Daniela, Thomas->Tom, and Jimmy->James. The strategy I mentioned should help you with all the variation.
Additional resources to explore:
https://github.com/jamesturk/jellyfish
https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
https://pypi.org/project/phonetics/
New contributor
$endgroup$
add a comment |
$begingroup$
I had a similar problem in my last job. My solution was to build features via (transformation(s) + comparison) * many combos and feed to models, then aggregate and model, i.e. 2 layer model. The key is encoding and similarity scores as features.
Transforms: remove vowels (great for certain roots), remove end vowels, remove double characters, convert to phonetic string (IPA, soundex, https://pypi.org/project/Fuzzy/), replace characters that either sound similar or have different sounds in other languages ($J$ in East Europe sounds like $Y$ in US, $C$ can sound like $K, D~T, Tsim TH$, etc), ... The strategy is to handle lots of weirdness/irregularity in people's names.
Comparisons (similarity and difference): try [character level, block/root/[pre/suf]fix level, word level (may not apply to you)] similarity and difference scores. Try Dice's coefficient, Levenshtein, Needleman–Wunsch, Longest common (non)contiguous substring, character histogram similarity, # characters matching, not matching (each left and right), etc. You could try using an RNN/LSTM and have it learn similarity for each transform. Use the output of the trained model(s) as another feature.
Experiment with different combos of the above and select a few that seem to have value. You could simply take all the scores and fit with Logistic Regression (or Neural Net) or you could build statistical models and output percent rank based on a small training set to normalize it. Another way to preprocess the raw scores is to use calibration encoding via logistic function. Then add summary stats from the normalized scores as additional features. Push all this into the final model.
Will you handle names that are derived from Arabic, Spanish, French, etc names? This is just extra, but consider downloading the Social Security and US Census name stats data to enhance your project with more name variations. I'll leave the how to you, but it helps knowing about the likely possibilities. Be aware that simply using Levenshtein doesn't work so well with William->Bill, Dianne->Di, Larry->Lawrence, Mohammed->Muhamed and Hamed, Danielle->Daniela, Thomas->Tom, and Jimmy->James. The strategy I mentioned should help you with all the variation.
Additional resources to explore:
https://github.com/jamesturk/jellyfish
https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
https://pypi.org/project/phonetics/
New contributor
$endgroup$
I had a similar problem in my last job. My solution was to build features via (transformation(s) + comparison) * many combos and feed to models, then aggregate and model, i.e. 2 layer model. The key is encoding and similarity scores as features.
Transforms: remove vowels (great for certain roots), remove end vowels, remove double characters, convert to phonetic string (IPA, soundex, https://pypi.org/project/Fuzzy/), replace characters that either sound similar or have different sounds in other languages ($J$ in East Europe sounds like $Y$ in US, $C$ can sound like $K, D~T, Tsim TH$, etc), ... The strategy is to handle lots of weirdness/irregularity in people's names.
Comparisons (similarity and difference): try [character level, block/root/[pre/suf]fix level, word level (may not apply to you)] similarity and difference scores. Try Dice's coefficient, Levenshtein, Needleman–Wunsch, Longest common (non)contiguous substring, character histogram similarity, # characters matching, not matching (each left and right), etc. You could try using an RNN/LSTM and have it learn similarity for each transform. Use the output of the trained model(s) as another feature.
Experiment with different combos of the above and select a few that seem to have value. You could simply take all the scores and fit with Logistic Regression (or Neural Net) or you could build statistical models and output percent rank based on a small training set to normalize it. Another way to preprocess the raw scores is to use calibration encoding via logistic function. Then add summary stats from the normalized scores as additional features. Push all this into the final model.
Will you handle names that are derived from Arabic, Spanish, French, etc names? This is just extra, but consider downloading the Social Security and US Census name stats data to enhance your project with more name variations. I'll leave the how to you, but it helps knowing about the likely possibilities. Be aware that simply using Levenshtein doesn't work so well with William->Bill, Dianne->Di, Larry->Lawrence, Mohammed->Muhamed and Hamed, Danielle->Daniela, Thomas->Tom, and Jimmy->James. The strategy I mentioned should help you with all the variation.
Additional resources to explore:
https://github.com/jamesturk/jellyfish
https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
https://pypi.org/project/phonetics/
New contributor
edited 10 hours ago
Siong Thye Goh
1,387519
1,387519
New contributor
answered 12 hours ago
ldmtwoldmtwo
1012
1012
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47617%2ffuzzy-name-and-nickname-match%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
How man samples are there for training ?
$endgroup$
– Shamit Verma
Mar 19 at 13:38
$begingroup$
Around 100k, but only the 17% have match = 0
$endgroup$
– David Masip
Mar 19 at 13:46
$begingroup$
Is this an open dataset that you can share a link to?
$endgroup$
– Adarsh Chavakula
yesterday
$begingroup$
how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
$endgroup$
– iamklaus
5 hours ago