Fuzzy name and nickname matchHow to detect product name from the bill text?Accuracy doesn't match in KerasName Tagger in Stanford NLPHow to programmatically suggest changes to a text to match some other text?How to filter Named Entity Recognition resultsBest way to extract information from text description and match it with set of wordsNLP: To remove verb and find the match in a sentenceImportError: cannot import name 'StanfordCoreNLPParser'Determining first name and surname from a name pair?NLP: Fuzzy Word/Phrase Match

Is it possible to have a strip of cold climate in the middle of a planet?

What is this called? Old film camera viewer?

React - map array to child component

Bob has never been a M before

How can Trident be so inexpensive? Will it orbit Triton or just do a (slow) flyby?

Is the U.S. Code copyrighted by the Government?

What prevents the use of a multi-segment ILS for non-straight approaches?

Need a math help for the Cagan's model in macroeconomics

Loading commands from file

Is there a single word describing earning money through any means?

Open a doc from terminal, but not by its name

Strong empirical falsification of quantum mechanics based on vacuum energy density

What was the exact wording from Ivanhoe of this advice on how to free yourself from slavery?

Can somebody explain the brexit thing in one or two child-proof sentences?

why `nmap 192.168.1.97` returns less services than `nmap 127.0.0.1`?

what is different between Do you interest vs interested in something?

How do you respond to a colleague from another team when they're wrongly expecting that you'll help them?

Store Credit Card Information in Password Manager?

Pre-mixing cryogenic fuels and using only one fuel tank

Is this toilet slogan correct usage of the English language?

What is the difference between Reference and Background image in 2.8

Melting point of aspirin, contradicting sources

Why should universal income be universal?

Is there a name for this algorithm to calculate the concentration of a mixture of two solutions containing the same solute?

Fuzzy name and nickname match

How to detect product name from the bill text?Accuracy doesn't match in KerasName Tagger in Stanford NLPHow to programmatically suggest changes to a text to match some other text?How to filter Named Entity Recognition resultsBest way to extract information from text description and match it with set of wordsNLP: To remove verb and find the match in a sentenceImportError: cannot import name 'StanfordCoreNLPParser'Determining first name and surname from a name pair?NLP: Fuzzy Word/Phrase Match

I have a dataset with the following structure:

full_name,nickname,match Christian Douglas,Chris,1, Jhon Stevens,Charlie,0, David Jr Simpson,Junior,1 Anastasia Williams,Stacie,1 Lara Williams,Ana,0 John Williams,Willy,1

where each predictor row is a pair full name, nickname, and the target variable, match, which is 1 when the nickname corresponds to the person with that name and 0 otherwise. As you can see, the way the nickname is obtained from the full name doesn't follow a particular pattern.

I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.

My baseline is just trying to see the number of carachters that match, and features like that. However, I am thinking about an NLP approach using deep learning. My question is wether there are neural network architectures that are specific of this problem.

edited Mar 19 at 14:17

asked Mar 19 at 13:36

David Masip

2,4941326

This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

Looking for an answer drawing from credible and/or official sources.

Should give an important overview/state of the art of the question

1

$begingroup$
How man samples are there for training ?
$endgroup$
– Shamit Verma
Mar 19 at 13:38

$begingroup$
Around 100k, but only the 17% have match = 0
$endgroup$
– David Masip
Mar 19 at 13:46

$begingroup$
Is this an open dataset that you can share a link to?
$endgroup$
– Adarsh Chavakula
yesterday

$begingroup$
how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
$endgroup$
– iamklaus
5 hours ago

add a comment |

I have a dataset with the following structure:

full_name,nickname,match Christian Douglas,Chris,1, Jhon Stevens,Charlie,0, David Jr Simpson,Junior,1 Anastasia Williams,Stacie,1 Lara Williams,Ana,0 John Williams,Willy,1

I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.

edited Mar 19 at 14:17

asked Mar 19 at 13:36

David Masip

2,4941326

This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

Looking for an answer drawing from credible and/or official sources.

Should give an important overview/state of the art of the question

1

$begingroup$
How man samples are there for training ?
$endgroup$
– Shamit Verma
Mar 19 at 13:38

$begingroup$
Around 100k, but only the 17% have match = 0
$endgroup$
– David Masip
Mar 19 at 13:46

$begingroup$
Is this an open dataset that you can share a link to?
$endgroup$
– Adarsh Chavakula
yesterday

$begingroup$
how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
$endgroup$
– iamklaus
5 hours ago

add a comment |

I have a dataset with the following structure:

full_name,nickname,match Christian Douglas,Chris,1, Jhon Stevens,Charlie,0, David Jr Simpson,Junior,1 Anastasia Williams,Stacie,1 Lara Williams,Ana,0 John Williams,Willy,1

I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.

edited Mar 19 at 14:17

asked Mar 19 at 13:36

David Masip

2,4941326

I have a dataset with the following structure:

full_name,nickname,match Christian Douglas,Chris,1, Jhon Stevens,Charlie,0, David Jr Simpson,Junior,1 Anastasia Williams,Stacie,1 Lara Williams,Ana,0 John Williams,Willy,1

I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.

deep-learning nlp

edited Mar 19 at 14:17

asked Mar 19 at 13:36

David Masip

2,4941326

edited Mar 19 at 14:17

asked Mar 19 at 13:36

David Masip

2,4941326

edited Mar 19 at 14:17

asked Mar 19 at 13:36

David Masip

2,4941326

asked Mar 19 at 13:36

David Masip

2,4941326

asked Mar 19 at 13:36

David Masip

2,4941326

This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

Looking for an answer drawing from credible and/or official sources.

Should give an important overview/state of the art of the question

This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

Looking for an answer drawing from credible and/or official sources.

Should give an important overview/state of the art of the question

1

$begingroup$
How man samples are there for training ?
$endgroup$
– Shamit Verma
Mar 19 at 13:38

$begingroup$
Around 100k, but only the 17% have match = 0
$endgroup$
– David Masip
Mar 19 at 13:46

$begingroup$
Is this an open dataset that you can share a link to?
$endgroup$
– Adarsh Chavakula
yesterday

$begingroup$
how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
$endgroup$
– iamklaus
5 hours ago

add a comment |

1

$begingroup$
How man samples are there for training ?
$endgroup$
– Shamit Verma
Mar 19 at 13:38

$begingroup$
Around 100k, but only the 17% have match = 0
$endgroup$
– David Masip
Mar 19 at 13:46

$begingroup$
Is this an open dataset that you can share a link to?
$endgroup$
– Adarsh Chavakula
yesterday

$begingroup$
how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.
$endgroup$
– iamklaus
5 hours ago

How man samples are there for training ?

– Shamit Verma
Mar 19 at 13:38

Around 100k, but only the 17% have match = 0

– David Masip
Mar 19 at 13:46

Is this an open dataset that you can share a link to?

– Adarsh Chavakula
yesterday

how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think.

– iamklaus
5 hours ago

add a comment |

2 Answers
2

active

oldest

votes

I couldn't find any useful literature out there for using deep learning for this specific problem. Most methods seem to rely on non-machine learning methods like string similarities and Levenstein distances. A reasonable deep learning based approach to this problem would be a Recurrent Neural Network. An LSTM (Long short term memory) or GRU (Gated Recurrent Unit) would be ideal. The idea is to have an RNN which has an internal state and respects the order in which the inputs are fed.

Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.

For example

Christian Douglas,Chris,1
Jhon Stevens,Charlie,0

would become

[C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
[J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]

The two strings to be matched are concatenated into a single sequence. The intuition here is that the RNN would process the sequence character by character and learn (read update weights) that the characters at the end have a similar pattern to what it saw earlier in the same sequence to deduce that it should be a 1 instead of 0.

The vector of [1/0] is the target variable.

The standard RNN pre-processing steps apply as usual - we'll pad the sequences in the beginning so that they're all the same length (say 50), the characters would be encoded as numeric instead of string etc.

Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.

Framing the problem in this manner allows us to use a vanilla RNN or an out-of-the-box LSTM/GRU instead of creating a custom architecture that takes two separate strings as input for each data point and throws out a number.

You could give this approach a shot and see if is able to satisfactorily beats baseline models.

A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.

answered yesterday

Adarsh Chavakula

25525

add a comment |

I had a similar problem in my last job. My solution was to build features via (transformation(s) + comparison) * many combos and feed to models, then aggregate and model, i.e. 2 layer model. The key is encoding and similarity scores as features.

Transforms: remove vowels (great for certain roots), remove end vowels, remove double characters, convert to phonetic string (IPA, soundex, https://pypi.org/project/Fuzzy/), replace characters that either sound similar or have different sounds in other languages ($J$ in East Europe sounds like $Y$ in US, $C$ can sound like $K, D~T, Tsim TH$, etc), ... The strategy is to handle lots of weirdness/irregularity in people's names.

Comparisons (similarity and difference): try [character level, block/root/[pre/suf]fix level, word level (may not apply to you)] similarity and difference scores. Try Dice's coefficient, Levenshtein, Needleman–Wunsch, Longest common (non)contiguous substring, character histogram similarity, # characters matching, not matching (each left and right), etc. You could try using an RNN/LSTM and have it learn similarity for each transform. Use the output of the trained model(s) as another feature.

Experiment with different combos of the above and select a few that seem to have value. You could simply take all the scores and fit with Logistic Regression (or Neural Net) or you could build statistical models and output percent rank based on a small training set to normalize it. Another way to preprocess the raw scores is to use calibration encoding via logistic function. Then add summary stats from the normalized scores as additional features. Push all this into the final model.

Will you handle names that are derived from Arabic, Spanish, French, etc names? This is just extra, but consider downloading the Social Security and US Census name stats data to enhance your project with more name variations. I'll leave the how to you, but it helps knowing about the likely possibilities. Be aware that simply using Levenshtein doesn't work so well with William->Bill, Dianne->Di, Larry->Lawrence, Mohammed->Muhamed and Hamed, Danielle->Daniela, Thomas->Tom, and Jimmy->James. The strategy I mentioned should help you with all the variation.

Additional resources to explore:
https://github.com/jamesturk/jellyfish
https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
https://pypi.org/project/phonetics/

edited 10 hours ago

Siong Thye Goh

1,387519

answered 12 hours ago

ldmtwo

1012

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47617%2ffuzzy-name-and-nickname-match%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.

For example

Christian Douglas,Chris,1
Jhon Stevens,Charlie,0

would become

[C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
[J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]

The vector of [1/0] is the target variable.

Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.

You could give this approach a shot and see if is able to satisfactorily beats baseline models.

A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.

answered yesterday

Adarsh Chavakula

25525

add a comment |

Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.

For example

Christian Douglas,Chris,1
Jhon Stevens,Charlie,0

would become

[C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
[J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]

The vector of [1/0] is the target variable.

Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.

You could give this approach a shot and see if is able to satisfactorily beats baseline models.

A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.

answered yesterday

Adarsh Chavakula

25525

add a comment |

Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.

For example

Christian Douglas,Chris,1
Jhon Stevens,Charlie,0

would become

[C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
[J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]

The vector of [1/0] is the target variable.

Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.

You could give this approach a shot and see if is able to satisfactorily beats baseline models.

A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.

answered yesterday

Adarsh Chavakula

25525

Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.

For example

Christian Douglas,Chris,1
Jhon Stevens,Charlie,0

would become

[C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
[J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e] --> [0]

The vector of [1/0] is the target variable.

Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.

You could give this approach a shot and see if is able to satisfactorily beats baseline models.

A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.

answered yesterday

Adarsh Chavakula

25525

answered yesterday

Adarsh Chavakula

25525

answered yesterday

Adarsh Chavakula

25525

answered yesterday

Adarsh Chavakula

25525

add a comment |

Additional resources to explore:
https://github.com/jamesturk/jellyfish
https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
https://pypi.org/project/phonetics/

edited 10 hours ago

Siong Thye Goh

1,387519

answered 12 hours ago

ldmtwo

1012

New contributor

add a comment |

Additional resources to explore:
https://github.com/jamesturk/jellyfish
https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
https://pypi.org/project/phonetics/

edited 10 hours ago

Siong Thye Goh

1,387519

answered 12 hours ago

ldmtwo

1012

New contributor

add a comment |

Additional resources to explore:
https://github.com/jamesturk/jellyfish
https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
https://pypi.org/project/phonetics/

edited 10 hours ago

Siong Thye Goh

1,387519

answered 12 hours ago

ldmtwo

1012

New contributor

Additional resources to explore:
https://github.com/jamesturk/jellyfish
https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love
https://pypi.org/project/phonetics/

edited 10 hours ago

Siong Thye Goh

1,387519

answered 12 hours ago

ldmtwo

1012

New contributor

edited 10 hours ago

Siong Thye Goh

1,387519

edited 10 hours ago

Siong Thye Goh

1,387519

edited 10 hours ago

Siong Thye Goh

1,387519

answered 12 hours ago

ldmtwo

1012

New contributor

answered 12 hours ago

ldmtwo

1012

answered 12 hours ago

ldmtwo

1012

New contributor

ldmtwo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

r04o9jj5UsG7HzVj3cqS31n,CJ7xiVb 6X0ycEb6OnC91YZodYWePQ6aKVZbgmEIWmbKms55o9Hob T55nk BYWnBy0fsSAiGe1,zEM

搜尋此網誌

Trjtdtk

This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

This question has an open bounty worth +50 reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

This question has an open bounty worth +50 reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

This question has an open bounty worth +50 reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

This question has an open bounty worth +50 reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

This question has an open bounty worth +50
reputation from David Masip ending ending at 2019-03-29 13:56:55Z">in 5 days.

2 Answers
2

2 Answers
2

2 Answers
2